CFP last date
20 December 2024
Reseach Article

Subtitle Generating Media Player using Mozilla DeepSpeech Model

by Waat Perera, B. Hettige
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 185 - Number 28
Year of Publication: 2023
Authors: Waat Perera, B. Hettige
10.5120/ijca2023923033

Waat Perera, B. Hettige . Subtitle Generating Media Player using Mozilla DeepSpeech Model. International Journal of Computer Applications. 185, 28 ( Aug 2023), 34-42. DOI=10.5120/ijca2023923033

@article{ 10.5120/ijca2023923033,
author = { Waat Perera, B. Hettige },
title = { Subtitle Generating Media Player using Mozilla DeepSpeech Model },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2023 },
volume = { 185 },
number = { 28 },
month = { Aug },
year = { 2023 },
issn = { 0975-8887 },
pages = { 34-42 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume185/number28/32871-2023923033/ },
doi = { 10.5120/ijca2023923033 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:27:19.143799+05:30
%A Waat Perera
%A B. Hettige
%T Subtitle Generating Media Player using Mozilla DeepSpeech Model
%J International Journal of Computer Applications
%@ 0975-8887
%V 185
%N 28
%P 34-42
%D 2023
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Subtitles plays a major role when comes to consuming media. Most of the time either media content comes without any subtitles or comes with basic subtitles in the native language. So, finding subtitles from another language than the native or creating subtitles for a new media content wasn’t an easy task. For famous films, tv shows or sometimes songs could find subtitles in more than one language but there are majority of content that isn’t exposed to internet. To address this issue this paper proposes a method to generate real-time subtitles for selected languages using English language media files through the existing Mozilla DeepSpeech and Google Cloud Platform Translation API. This proposed system takes any English media content from .mp4 file format as the input and generate subtitle according to the users desired language preference as a .srt output. Further, this paper also describes an overview of existing methods for Speech to Text conversion, advantages and disadvantages that are compared with Mozilla DeepSpeech model. The system has been tested with Human evaluation methods as well as automated evaluation method namely BLEU.

References
  1. A. Ramani, A. Rao, V. Vidya, and V. B. Prasad, “Automatic Subtitle Generation for Videos,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Mar. 2020, pp. 132–135. doi: 10.1109/ICACCS48705.2020.9074180.
  2. N. Radha and R. Pradeep, “Automated subtitle generation,” vol. 10, pp. 24741–24746, Jan. 2015.
  3. B. Xu, C. Tao, Z. Feng, Y. Raqui, and S. Ranwez, A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect. 2021.
  4. P. R. Hjulström, “Evaluation of a speech recognition system,” 2015. https://www.semanticscholar.org/paper/Evaluation-of-a-speech-recognition-system-Hjulstr%C3%B6m/49c1997d54811c7eb79463260f3513c2a89b7235 (accessed Oct. 10, 2022).
  5. J. Huang et al., “The IBM Rich Transcription Spring 2006 Speech-to-Text System for Lecture Meetings,” May 2006, pp. 432–443. doi: 10.1007/11965152_38.
  6. R. D. Sharp et al., “The Watson speech recognition engine,” in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr. 1997, pp. 4065–4068 vol.5. doi: 10.1109/ICASSP.1997.604839.
  7. F. Filippidou and L. Moussiades, “Α Benchmarking of IBM, Google and Wit Automatic Speech Recognition Systems,” Artificial Intelligence Applications and Innovations, vol. 583, pp. 73–82, May 2020, doi: 10.1007/978-3-030-49161-1_7.
  8. M. Stenman, “Automatic speech recognition An evaluation of Google Speech,” undefined, 2015, Accessed: Oct. 10, 2022. [Online]. Available: https://www.semanticscholar.org/paper/Automatic-speech-recognition-An-evaluation-of-Stenman/69dab8bf2f729ed94f53a2dd5df03799258b34a8
  9. N. Anggraini, A. Kuniawan, L. Wardhani, and N. Hakiem, “Speech Recognition Application for the Speech Impaired using the Android-based Google Cloud Speech API,” Telkomnika (Telecommunication Computing Electronics and Control), vol. 16, pp. 2733–2739, Dec. 2018, doi: 10.12928/TELKOMNIKA.v16i6.9638.
  10. J. Y. Chan and H. H. Wang, “Speech Recorder and Translator using Google Cloud Speech-to-Text and Translation | Journal of IT in Asia,” Dec. 2021, Accessed: Oct. 10, 2022. [Online]. Available: https://publisher.unimas.my/ojs/index.php/JITA/article/view/2815
  11. A. Agarwal and T. Zesch, Robustness of end-to-end Automatic Speech Recognition Models -- A Case Study using Mozilla DeepSpeech. 2021.
  12. A. Agarwal and T. Zesch, “LTL-UDE at Low-Resource Speech-to-Text Shared Task: Investigating Mozilla DeepSpeech in a low-resource setting,” p. 5.
  13. E. Nacimiento-García, C. S. González-González, and F. L. Gutiérrez-Vela, “Automatic captions on video calls, a must for the elderly: Using Mozilla DeepSpeech for the STT,” in Proceedings of the XXI International Conference on Human Computer Interaction, in Interacción ’21. New York, NY, USA: Association for Computing Machinery, Sep. 2021, pp. 1–7. doi: 10.1145/3471391.3471392.
  14. A. Sherstinsky, “Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, Mar. 2020, doi: 10.1016/j.physd.2019.132306.
  15. H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition.” arXiv, Feb. 05, 2014. doi: 10.48550/arXiv.1402.1128.
  16. G. E. Dahl, Dong Yu, Li Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 1, pp. 30–42, Jan. 2012, doi: 10.1109/TASL.2011.2134090.
  17. A. Amberkar, P. Awasarmol, G. Deshmukh, and P. Dave, “Speech Recognition using Recurrent Neural Networks,” Mar. 2018, pp. 1–4. doi: 10.1109/ICCTCT.2018.8551185.
  18. A. F. Agarap, “Deep Learning using Rectified Linear Units (ReLU).” arXiv, Feb. 07, 2019. Accessed: Oct. 10, 2022. [Online]. Available: http://arxiv.org/abs/1803.08375
  19. C. K. On, P. M. Pandiyan, S. Yaacob, and A. Saudi, “Mel-frequency cepstral coefficient analysis in speech recognition,” in 2006 International Conference on Computing & Informatics, Jun. 2006, pp. 1–5. doi: 10.1109/ICOCI.2006.5276486.
  20. L. Muda, M. Begam, and I. Elamvazuthi, “Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques.” arXiv, Mar. 22, 2010. doi: 10.48550/arXiv.1003.4083.
  21. R. Vergin, D. O’Shaughnessy, and A. Farhat, “Generalized mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 5, pp. 525–532, Sep. 1999, doi: 10.1109/89.784104.
  22. A. Graves, “Connectionist Temporal Classification,” in Supervised Sequence Labelling with Recurrent Neural Networks, A. Graves, Ed., in Studies in Computational Intelligence. Berlin, Heidelberg: Springer, 2012, pp. 61–93. doi: 10.1007/978-3-642-24797-2_7.
  23. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ’networks,” presented at the ICML 2006 - Proceedings of the 23rd International Conference on Machine Learning, Jan. 2006, pp. 369–376. doi: 10.1145/1143844.1143891.
  24. H. Scheidl, S. Fiel, and R. Sablatnig, “Word Beam Search: A Connectionist Temporal Classification Decoding Algorithm,” in 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Aug. 2018, pp. 253–258. doi: 10.1109/ICFHR-2018.2018.00052.
  25. A. Hannun, “Sequence Modeling with CTC,” Distill, vol. 2, no. 11, p. e8, Nov. 2017, doi: 10.23915/distill.00008.
  26. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a Method for Automatic Evaluation of Machine Translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. doi: 10.3115/1073083.1073135.
  27. D. Dmello, “donnabelldmello/nlp-bleu.” Nov. 17, 2019. Accessed: Oct. 22, 2022. [Online]. Available: https://github.com/donnabelldmello/nlp-bleu
Index Terms

Computer Science
Information Sciences

Keywords

Deep Learning DeepSpeech Language Translation Media Player Mozilla Speech Recognition Speech to Text Subtitle Generation.