CFP last date
20 January 2025
Reseach Article

An Overview of Speech-to-Speech Translation Framework and its Modules

by Nasrin Ehassan, Cosimo Ieracitano, Mandar Gogate, Kia Dashtipour, Amir Hussain
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 185 - Number 26
Year of Publication: 2023
Authors: Nasrin Ehassan, Cosimo Ieracitano, Mandar Gogate, Kia Dashtipour, Amir Hussain
10.5120/ijca2023922989

Nasrin Ehassan, Cosimo Ieracitano, Mandar Gogate, Kia Dashtipour, Amir Hussain . An Overview of Speech-to-Speech Translation Framework and its Modules. International Journal of Computer Applications. 185, 26 ( Aug 2023), 16-26. DOI=10.5120/ijca2023922989

@article{ 10.5120/ijca2023922989,
author = { Nasrin Ehassan, Cosimo Ieracitano, Mandar Gogate, Kia Dashtipour, Amir Hussain },
title = { An Overview of Speech-to-Speech Translation Framework and its Modules },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2023 },
volume = { 185 },
number = { 26 },
month = { Aug },
year = { 2023 },
issn = { 0975-8887 },
pages = { 16-26 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume185/number26/32851-2023922989/ },
doi = { 10.5120/ijca2023922989 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:27:05.094680+05:30
%A Nasrin Ehassan
%A Cosimo Ieracitano
%A Mandar Gogate
%A Kia Dashtipour
%A Amir Hussain
%T An Overview of Speech-to-Speech Translation Framework and its Modules
%J International Journal of Computer Applications
%@ 0975-8887
%V 185
%N 26
%P 16-26
%D 2023
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Speech is the most natural form of human communication and arguably the most efficient method of exchanging information. However, communication between people who only speak different languages is a very challenging task. Speech-to-Speech translation (S2ST) attempts to overcome this issue, making it one of the most promising research domains in speech and Natural Language Processing (NLP). This present article reviews the most recent S2ST systems employed for different languages in terms of their constituent modules, namely Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-To-Speech (TTS). Furthermore, the paper critically highlights the main advantages and disadvantages of state-of-the-art techniques in S2ST in order to provide researchers with an up-to-date picture of current systems and potential directions for future work.

References
  1. W. Wahlster, Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013.
  2. R. Zbib et al., “Machine translation of Arabic dialects,” in Proceedings of the 2012 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2012, pp. 49–59.
  3. M. A. M. Abushariah, R. N. Ainon, R. Zainuddin, M. Elshafei, and O. O. Khalifa, “Natural speaker-independent Arabic speech recognition system based on Hidden Markov Models using Sphinx tools,” in International Conference on Computer and Communication Engineering (ICCCE’10), 2010, pp. 1–6.
  4. M. Gogate, K. Dashtipour, P. Bell, and A. Hussain, “Deep neural network driven binaural audio visual speech separation,” in 2020 international joint conference on neural networks (IJCNN), 2020, pp. 1–7.
  5. M. Gogate, K. Dashtipour, and A. Hussain, “Visual Speech In Real Noisy Environments (VISION): A Novel Benchmark Dataset and Deep Learning-Based Baseline System.,” in Interspeech, 2020, pp. 4521–4525.
  6. Y. Bar-Hillel, “The present status of automatic translation of languages,” Advances in computers, vol. 1, pp. 91–163, 1960.
  7. S. Rajpirathap, S. Sheeyam, K. Umasuthan, and A. Chelvarajah, “Real-time direct translation system for Sinhala and Tamil languages,” in 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), 2015, pp. 1437–1443.
  8. S. Poria, O. Y. Soon, B. Liu, and L. Bing, “Affect recognition for multimodal natural language processing,” Cognit Comput, vol. 13, no. 2, pp. 229–230, 2021.
  9. C. Ieracitano, A. Adeel, F. C. Morabito, and A. Hussain, “A novel statistical analysis and autoencoder driven intelligent intrusion detection approach,” Neurocomputing, vol. 387, pp. 51–62, 2020.
  10. C. Ieracitano, A. Paviglianiti, M. Campolo, A. Hussain, E. Pasero, and F. C. Morabito, “A novel automatic classification system based on hybrid unsupervised and supervised machine learning for electrospun nanofibers,” IEEE/CAA Journal of Automatica Sinica, vol. 8, no. 1, pp. 64–76, 2020.
  11. Z. Cai and L. Shao, “Rgb-d scene classification via multi-modal feature learning,” Cognit Comput, vol. 11, no. 6, pp. 825–840, 2019.
  12. K. Dashtipour, M. Gogate, J. Li, F. Jiang, B. Kong, and A. Hussain, “A hybrid Persian sentiment analysis framework: Integrating dependency grammar based rules and deep neural networks,” Neurocomputing, vol. 380, pp. 1–10, 2020.
  13. H. U. Mullah, “A Comparative Study of Different Text-to-Speech Synthesis Techniques,” Int J Sci Eng Res, vol. 6, no. 6, 2015.
  14. D. Moussallem, M. Wauer, and A.-C. N. Ngomo, “Machine translation using semantic web technologies: A survey,” Journal of Web Semantics, vol. 51, pp. 1–19, 2018.
  15. D. W. Lonsdale, A. Franz, and J. R. R. Leavitt, “Large-Scale Machine Translation: An Interlingua Approach.,” in Iea/aie, 1994, pp. 525–530.
  16. M. Madankar, M. B. Chandak, and N. Chavhan, “Information retrieval system and machine translation: a review,” Procedia Comput Sci, vol. 78, pp. 845–850, 2016.
  17. M. D. Okpor, “Machine translation approaches: issues and challenges,” International Journal of Computer Science Issues (IJCSI), vol. 11, no. 5, p. 159, 2014.
  18. M. Rushdi-Saleh, M. T. Mart\’\in-Valdivia, L. A. U. Lopez, and J. M. Perea-Ortega, “Bilingual experiments with an arabic-english corpus for opinion mining,” in Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, 2011, pp. 740–745.
  19. B. A. Abdulsalami and B. J. Akinsanya, “Review of different approaches for machine translations,” International Journal of Mathematics Trends and Technology (IJMTT), vol. 48, no. 3, pp. 197–202, 2017.
  20. S. Dubey, “Survey of Machine Translation Techniques,” International Journal of Advance Research in Computer Science and Management Studies, Special Issue, vol. 5, no. 2, pp. 39–51, 2017.
  21. I. Isewon, J. Oyelade, and O. Oladipupo, “Design and implementation of text to speech conversion for visually impaired people,” Int J Appl Inf Syst, vol. 7, no. 2, pp. 25–30, 2014.
  22. A. Zhang et al., “Clustering of remote sensing imagery using a social recognition-based multi-objective gravitational search algorithm,” Cognit Comput, vol. 11, no. 6, pp. 789–798, 2019.
  23. M. Gogate, A. Adeel, R. Marxer, J. Barker, and A. Hussain, “DNN driven speaker independent audio-visual mask estimation for speech separation,” arXiv preprint arXiv:1808.00060, 2018.
  24. M. Gogate, A. Adeel, and A. Hussain, “Deep learning driven multimodal fusion for automated deception detection,” in 2017 IEEE symposium series on computational intelligence (SSCI), 2017, pp. 1–6.
  25. X. Yang, K. Huang, R. Zhang, and J. Y. Goulermas, “A novel deep density model for unsupervised learning,” Cognit Comput, vol. 11, no. 6, pp. 778–788, 2019.
  26. M. Dureja and S. Gautam, “Speech-to-Speech Translation: A Review,” Int J Comput Appl, vol. 129, no. 13, pp. 28–30, 2015.
  27. A. Katyal, A. Kaur, and J. Gill, “Automatic speech recognition: a review,” International Journal of Engineering and Advanced Technology (IJEAT), vol. 3, no. 3, pp. 71–74, 2014.
  28. S. K. Gaikwad, B. W. Gawali, and P. Yannawar, “A review on speech recognition technique,” Int J Comput Appl, vol. 10, no. 3, pp. 16–24, 2010.
  29. S. Preeti and K. Parneet, “Automatic speech recognition: A review,” International Journal of Engineering Trends and Technology, vol. 4, no. 2, p. 2013, 2013.
  30. J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise-robust automatic speech recognition,” IEEE/ACM Trans Audio Speech Lang Process, vol. 22, no. 4, pp. 745–777, 2014.
  31. S. J. Arora and R. P. Singh, “Automatic speech recognition: a review,” Int J Comput Appl, vol. 60, no. 9, 2012.
  32. M. Benzeghiba et al., “Automatic speech recognition and speech variability: A review,” Speech Commun, vol. 49, no. 10–11, pp. 763–786, 2007.
  33. A. Alqudsi, N. Omar, and K. Shaker, “Arabic machine translation: a survey,” Artif Intell Rev, vol. 42, no. 4, pp. 549–572, 2014.
  34. M. R. Costa-Jussa and J. A. R. Fonollosa, “Latest trends in hybrid machine translation and its applications,” Comput Speech Lang, vol. 32, no. 1, pp. 3–10, 2015.
  35. F. Gaspari, H. Almaghout, and S. Doherty, “A survey of machine translation competences: Insights for translation technology educators and practitioners,” Perspectives (Montclair), vol. 23, no. 3, pp. 333–358, 2015.
  36. N. J. Khan, W. Anwar, and N. Durrani, “Machine translation approaches and survey for Indian languages,” arXiv preprint arXiv:1701.04290, 2017.
  37. R. K. Chakrawarti and P. Bansal, “Approaches for improving Hindi to English machine translation system,” Indian J Sci Technol, vol. 10, no. 16, pp. 1–8, 2017.
  38. R. K. Chakrawarti, H. Mishra, and P. Bansal, “Review of machine translation techniques for idea of Hindi to English idiom translation,” International journal of computational intelligence research, vol. 13, no. 5, pp. 1059–1071, 2017.
  39. M. Z. Rashad, H. M. El-Bakry, I. R. Isma’il, and N. Mastorakis, “An overview of text-to-speech synthesis techniques,” Latest trends on communications and information technology, pp. 84–89, 2010.
  40. W. Mattheyses and W. Verhelst, “Audiovisual speech synthesis: An overview of the state-of-the-art,” Speech Commun, vol. 66, pp. 182–217, 2015.
  41. S. Matsuda et al., “Multilingual speech-to-speech translation system: VoiceTra,” in 2013 IEEE 14th International Conference on Mobile Data Management, 2013, pp. 229–233.
  42. S. Yun, Y.-J. Lee, and S.-H. Kim, “Multilingual speech-to-speech translation system for mobile consumer devices,” IEEE Transactions on Consumer Electronics, vol. 60, no. 3, pp. 508–516, 2014.
  43. J. Chen, S. Wen, V. K. R. Sridhar, and S. Bangalore, “Multilingual web conferencing using speech-to-speech translation.,” in INTERSPEECH, 2013, pp. 1861–1863.
  44. A. Abdelali, A. Ali, F. Guzmán, F. Stahlberg, S. Vogel, and Y. Zhang, “QAT2—The QCRI Advanced Transcription and Translation System,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  45. M. D. F. Ansari, R. S. Shaji, T. J. SivaKarthick, S. Vivek, and A. Aravind, “Multilingual speech to speech translation system in bluetooth environment,” in 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), 2014, pp. 1055–1058.
  46. F. Calefato, F. Lanubile, D. Romita, R. Prikladnicki, and J. H. S. Pinto, “Mobile speech translation for multilingual requirements meetings: A preliminary study,” in 2014 IEEE 9th international conference on global software engineering, 2014, pp. 145–152.
  47. A. Gopi, T. Sajini, J. Stephen, V. K. Bhadhran, and others, “Multilingual Speech to Speech MT based chat system,” in 2015 International Conference on Computing and Network Communications (CoCoNet), 2015, pp. 771–776.
  48. J. Stephen, M. Anjali, and V. K. Bhadran, “Voice enabled multilingual newspaper reading system,” in 2013 IEEE Global Humanitarian Technology Conference: South Asia Satellite (GHTC-SAS), 2013, pp. 317–320.
  49. S. Nakamura, “Towards real-time multilingual multimodal speech-to-speech translation,” in Spoken Language Technologies for Under-Resourced Languages, 2014.
  50. D. Kamińska, T. Sapiński, and G. Anbarjafari, “Efficiency of chosen speech descriptors in relation to emotion recognition,” EURASIP J Audio Speech Music Process, vol. 2017, no. 1, pp. 1–9, 2017.
  51. K. M. O. Nahar, M. Abu Shquier, W. G. Al-Khatib, H. Al-Muhtaseb, and M. Elshafei, “Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition,” Int J Speech Technol, vol. 19, no. 3, pp. 495–508, 2016.
  52. A. Alshutayri, E. Atwell, A. Alosaimy, J. Dickins, M. Ingleby, and J. Watson, “Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts,” in Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), 2016, pp. 204–211.
  53. K. Bahreini, R. Nadolski, and W. Westera, “Towards real-time speech emotion recognition for affective e-learning,” Educ Inf Technol (Dordr), vol. 21, no. 5, pp. 1367–1386, 2016.
  54. K. Han, D. Yu, and I. Tashev, “Speech emotion recognition using deep neural network and extreme learning machine,” in Interspeech 2014, 2014.
  55. S. Sarma and A. Barman, “MULTILINGUAL SPEECH IDENTIFICATION USING ARTIFICIAL NEURAL NETWORK,” International Journal of Information Technology Convergence and Services (IJITCS) Vol, vol. 5, pp. 1–6.
  56. H. B. Sailor and H. A. Patil, “Filterbank learning using convolutional restricted Boltzmann machine for speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016, pp. 5895–5899.
  57. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, p. 27403, 1993.
  58. D. B. Paul and J. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
  59. J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual speech enhancement using multimodal deep convolutional neural networks,” IEEE Trans Emerg Top Comput Intell, vol. 2, no. 2, pp. 117–128, 2018.
  60. E. Gauthier, L. Besacier, and S. Voisin, “Automatic speech recognition for African languages with vowel length contrast,” Procedia Comput Sci, vol. 81, pp. 136–143, 2016.
  61. M. A. Menacer, O. Mella, D. Fohr, D. Jouvet, D. Langlois, and K. Smaili, “An enhanced automatic speech recognition system for Arabic,” in The third Arabic Natural Language Processing Workshop-EACL 2017, 2017.
  62. S. Agarwalla and K. K. Sarma, “Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech,” Neural Networks, vol. 78, pp. 97–111, 2016.
  63. H. B. Sailor and H. A. Patil, “Filterbank learning using convolutional restricted Boltzmann machine for speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016, pp. 5895–5899.
  64. A. Revathi, N. Sasikaladevi, and C. Jeyalakshmi, “Digital speech watermarking to enhance the security using speech as a biometric for person authentication,” Int J Speech Technol, vol. 21, pp. 1021–1031, 2018.
  65. K. Huang, Y. Liu, and Y. Hong, “Reduction of residual noise based on eigencomponent filtering for speech enhancement,” Int J Speech Technol, vol. 21, pp. 877–886, 2018.
  66. Q. T. Do, S. Sakti, G. Neubig, and S. Nakamura, “Transferring Emphasis in Speech Translation Using Hard-Attentional Neural Network Models.,” in INTERSPEECH, 2016, pp. 2533–2537.
  67. J. Nair, K. A. Krishnan, and R. Deetha, “An efficient English to Hindi machine translation system using hybrid mechanism,” in 2016 international conference on advances in computing, communications and informatics (ICACCI), 2016, pp. 2109–2113.
  68. M. Ma, D. Li, K. Zhao, and L. Huang, “Osu multimodal machine translation system report,” arXiv preprint arXiv:1710.02718, 2017.
  69. O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural machine translation with a shared attention mechanism,” arXiv preprint arXiv:1601.01073, 2016.
  70. M. M. A. Shquier and K. M. Alhawiti, “Fully automated Arabic to English machine translation system: transfer-based approach of AE-TBMT.,” Int. J. Inf. Commun. Technol., vol. 10, no. 4, pp. 376–391, 2017.
  71. A. Almahairi, K. Cho, N. Habash, and A. Courville, “First result on Arabic neural machine translation,” arXiv preprint arXiv:1606.02680, 2016.
  72. Q. T. Do, S. Sakti, and S. Nakamura, “Sequence-to-sequence models for emphasis speech translation,” IEEE/ACM Trans Audio Speech Lang Process, vol. 26, no. 10, pp. 1873–1883, 2018.
  73. J. Su, J. Zeng, D. Xiong, Y. Liu, M. Wang, and J. Xie, “A hierarchy-to-sequence attentional neural machine translation model,” IEEE/ACM Trans Audio Speech Lang Process, vol. 26, no. 3, pp. 623–632, 2018.
  74. H. AlRouqi et al., “Evaluating Arabic Text-to-Speech synthesizers for mobile phones,” in 2015 Tenth International Conference on Digital Information Management (ICDIM), 2015, pp. 89–94.
  75. M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
  76. K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
  77. O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural machine translation with a shared attention mechanism,” arXiv preprint arXiv:1601.01073, 2016.
  78. M. M. A. Shquier and K. M. Alhawiti, “Fully automated Arabic to English machine translation system: transfer-based approach of AE-TBMT,” International Journal of Information and Communication Technology, vol. 10, no. 4, pp. 376–391, 2017.
  79. A. Almahairi, K. Cho, N. Habash, and A. Courville, “First result on Arabic neural machine translation,” arXiv preprint arXiv:1606.02680, 2016.
  80. Q. T. Do, S. Sakti, and S. Nakamura, “Sequence-to-sequence models for emphasis speech translation,” IEEE/ACM Trans Audio Speech Lang Process, vol. 26, no. 10, pp. 1873–1883, 2018.
  81. I. Rebai and Y. BenAyed, “Text-to-speech synthesis system with Arabic diacritic recognition system,” Comput Speech Lang, vol. 34, no. 1, pp. 43–60, 2015.
  82. I. Rebai and Y. BenAyed, “Arabic speech synthesis and diacritic recognition,” Int J Speech Technol, vol. 19, no. 3, pp. 485–494, 2016.
  83. S. M. Abu-Soud, “ILATalk: a new multilingual text-to-speech synthesizer with machine learning,” Int J Speech Technol, vol. 19, no. 1, pp. 55–64, 2016.
  84. F. Araújo, A. Klautau, and others, “Genetic algorithm to estimate the input parameters of Klatt and HLSyn formant-based speech synthesizers,” Biosystems, vol. 150, pp. 190–193, 2016.
  85. P. Birkholz, L. Martin, Y. Xu, S. Scherbaum, and C. Neuschaefer-Rube, “Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis,” Comput Speech Lang, vol. 41, pp. 116–127, 2017.
  86. I. Rebai and Y. BenAyed, “Text-to-speech synthesis system with Arabic diacritic recognition system,” Comput Speech Lang, vol. 34, no. 1, pp. 43–60, 2015.
  87. I. Rebai and Y. BenAyed, “Arabic speech synthesis and diacritic recognition,” Int J Speech Technol, vol. 19, pp. 485–494, 2016.
  88. I. Abu Doush, F. Alkhatib, and A. A. R. Bsoul, “What we have and what is needed, how to evaluate Arabic Speech Synthesizer?,” Int J Speech Technol, vol. 19, pp. 415–432, 2016.
  89. A. Jafri, I. Sobh, and A. Alkhairy, “Statistical formant speech synthesis for Arabic,” Arab J Sci Eng, vol. 40, pp. 3151–3159, 2015.
Index Terms

Computer Science
Information Sciences

Keywords

Automatic Speech Recognition Machine Translation Speech to Speech translation Text to Speech.