CFP last date
20 March 2025
Reseach Article

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

by Thoudam Doren Singh
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 52 - Number 14
Year of Publication: 2012
Authors: Thoudam Doren Singh

Thoudam Doren Singh . Building Parallel Corpora for SMT System: A Case Study of English-Manipuri. International Journal of Computer Applications. 52, 14 ( August 2012), 47-51. DOI=10.5120/8274-1876

@article{ 10.5120/8274-1876,
author = { Thoudam Doren Singh },
title = { Building Parallel Corpora for SMT System: A Case Study of English-Manipuri },
journal = { International Journal of Computer Applications },
issue_date = { August 2012 },
volume = { 52 },
number = { 14 },
month = { August },
year = { 2012 },
issn = { 0975-8887 },
pages = { 47-51 },
numpages = {9},
url = { },
doi = { 10.5120/8274-1876 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
%0 Journal Article
%1 2024-02-06T20:52:16.442712+05:30
%A Thoudam Doren Singh
%T Building Parallel Corpora for SMT System: A Case Study of English-Manipuri
%J International Journal of Computer Applications
%@ 0975-8887
%V 52
%N 14
%P 47-51
%D 2012
%I Foundation of Computer Science (FCS), NY, USA

The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts parallel corpus between Manipuri, a morphologically rich and resource constrained Indian language and English has been developed from a web based comparable news corpora. We explore the crux of the parallel corpora towards improving the translation quality through linguistics factors for the language pair.

  1. Doddington, G. 2002. Automatic evaluation of Machine Translation quality using n-gram co-occurrence statistics. In Proceedings of HLT 2002, San Diego, CA.
  2. Gale, W. A. , Church, K. W. , 1991. A program for aligning sentences in bilingual corpora, In proceedings of 29th Annual meeting of ACL, Pages 177-184, Berkeley, California
  3. Gandhe, A. , Gangadharaiah, R. , Vishweswariah K. , Ramanathan, A. 2011. Handling Verb Phrase Morphology in Highly Inflected Indian Languages for Machine Translation, In proceedings of the 5th International Joint Conference on Natural Language Processing, Pages 111-119, Chiang Mai, Thailand, 2011.
  4. Koehn, P. , Hoang, H. 2007. Factored Translation Models, Conference on Empirical Methods in Natural Language Processing (EMNLP), Prague, Czech Republic.
  5. Koehn, P. , Hoang, H. , Birch, A. , Callison-Burch, C. , Federico, M. , Bertoldi, N. , Cowan, B. , Shen, W. , Moran, C. , Zens, R. , Dyer, C. , Bojar, O. , Constantin, A. , Herbst, E. 2007. Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration session, Prague, Czech Republic.
  6. Kolachina, P. , Concedda, N. , Dymetman, M. , Venkatapathy, S. , 2012. Prediction of learning curves in Machine Translation, In proceeding of the 50th Annual meeting of the ACL, Pages 22-30, Jeju, Korea.
  7. Ma, X. , 2006. Champollion: A Robust Parallel Text Sentence Aligner. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC). Genova, Italy.
  8. Malmkjaer, K. , 1998. Ed. Translation and Language Teaching: Language Teaching and Translation, Manchester, UK.
  9. McEnery, A. , Xiao, Z. 2007. Parallel and comparable corpora? In Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon, UK.
  10. Och, F. J. , Ney, H. 2003. A Systematic Comparison of Various Statistical Alignment Models, Computational Linguistics, volume 29, number 1, Pages. 19-51.
  11. Och, F. J. , 2003. Minimum error rate training in Statistical Machine Translation, In the proceeding of Proceedings of ACL.
  12. Papineni, K. , Roukos, S. , Ward, T. , and Zhu, W. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th ACL, Philadelphia, PA.
  13. Razmara , M. , Foster, G. , Sankaran, B. , Sarkar, A. 2012. Mixing multiple translation models in Statistical Machine Translation, In proceedings of the 50th Annual Meeting of the Association of Computational Linguistics (ACL 2012), Juju Island, Korea.
  14. Singh, T. D. , Bandyopadhyay, S. 2006. Word Class and Sentence Type Identification in Manipuri Morphological Analyzer, Proceeding of MSPIL 2006, IIT Bombay, Pages 11-17, Mumbai, India.
  15. Singh, T. D. , Bandyopadhyay, S. 2008. Morphology Driven Manipuri POS Tagger, In proceedings of IJCNLP-08 Workshop on NLPLPL, Pages 91-98, Hyderabad, India.
  16. Singh, T. D. , Bandyopadhyay, S. 2010a. Semi Automatic Parallel Corpora Extraction from Comparable News Corpora, In the International Journal of POLIBITS, Issue 41 (January – June 2010), ISSN 1870-9044, Pages 11-17.
  17. Singh, T. D. , Bandyopadhyay, S. , 2010b, Manipuri-English Example Based Machine Translation System, International Journal of Computational Linguistics and Applications (IJCLA), ISSN 0976-0962, Pages 147-158
  18. Singh, T. D. , Bandyopadhyay, S. 2010c. Statistical Machine Translation of English-Manipuri using Morpho-Syntactic and Semantic Information, In proceedings of Ninth Conference of the Association for Machine Translation in Americas (AMTA 2010), Pages 333-340, Denver, Colorado, USA.
  19. Singh, T. D. , Bandyopadhyay, S. 2010d. Web Based Manipuri Corpus for Multiword NER and Reduplicated MWEs Identification using SVM, Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), the 23rd International Conference on Computational Linguistics (COLING), Pages 35–42, Beijing.
  20. Singh, T. D. , Bandyopadhyay, S. 2011a, Bidirectional Statistical Machine Translation of Manipuri English Language Pair using Morpho-Syntactic and Dependency Relations, In International Journal of Translation (IJT), ISSN 0970-9819, Vol. 23, No. 1 (Jan-Jun), 2011, Pages 115-137.
  21. Singh, T. D. , Bandyopadhyay, S. 2011b, Integration of Reduplicated Multiword Expressions and Named Entities in a Phrase Based Statistical Machine Translation System, Proceedings of the 5th International Joint Conference on Natural Language Processing, Pages 1304–1312, Chiang Mai, Thailand, November 8 – 13, 2011.
  22. Stolcke, A. 2002. SRILM – an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing.
Index Terms

Computer Science
Information Sciences


Sentence alignment Precision Recall English-Manipuri Agglutinative Morphology