CFP last date
20 December 2024
Reseach Article

Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields

by Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 48 - Number 23
Year of Publication: 2012
Authors: Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar
10.5120/7522-0624

Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar . Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields. International Journal of Computer Applications. 48, 23 ( June 2012), 31-37. DOI=10.5120/7522-0624

@article{ 10.5120/7522-0624,
author = { Manikrao L Dhore, Shantanu K Dixit, Tushar D Sonwalkar },
title = { Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields },
journal = { International Journal of Computer Applications },
issue_date = { June 2012 },
volume = { 48 },
number = { 23 },
month = { June },
year = { 2012 },
issn = { 0975-8887 },
pages = { 31-37 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume48/number23/7522-0624/ },
doi = { 10.5120/7522-0624 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:44:52.479265+05:30
%A Manikrao L Dhore
%A Shantanu K Dixit
%A Tushar D Sonwalkar
%T Hindi to English Machine Transliteration of Named Entities using Conditional Random Fields
%J International Journal of Computer Applications
%@ 0975-8887
%V 48
%N 23
%P 31-37
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Machine transliteration has received significant research attention in recent years. In most cases, the source language has been English and the target language is an Asian language. This paper focuses on Hindi to English machine transliteration of Indian named entities such as proper nouns, place names and organization names using conditional random fields (CRF). Hindi is the national language of the India and spoken by more than 500 millions Indian. Hindi is the world's fourth most commonly used language after Chinese, English and Spanish. This system takes Indian place name as an input in Hindi language using Devanagari script and transliterates it into English. The input to the system is provided in the form of syllabification in order to apply the n-gram techniques. As more than 50% named entities are formed as a combination of two and three syllabic units, the n-gram approach with unigrams, bigrams and trigrams of Hindi are used to train the corpus. The system provides the satisfactory performance for trigrams as compared to unigrams and bigrams.

References
  1. Ankit Aggarwal, Transliteration involving English and Hindi languages using syllabification approach, Thesis, Indian Institute of Technology, Bombay, Mumbai, 2009
  2. Haizhou Li, A Kumaran, Vladimir Pervouchine and Min Zhang, Report of NEWS 2009 Machine transliteration shared task, named entities workshop: shared task on transliteration, Singapore, pp. 1-18, 2009
  3. Darvinder kaur, Vishal Gupta, A survey of named entity recognition in English and other Indian languages, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 6, pp. 239-245, November 2010
  4. Karimi S, Scholer F, and Turpin, Machine transliteration survey. ACM Computing Surveys, Vol. 43, No. 3, Article 17, pp. 1-46, April 2011.
  5. Arbabi M, Fischthal S M, Cheng V C and Bart E, Algorithms for Arabic name transliteration, IBM Journal of Research and Development. pp. 183-194, 1994
  6. Knight Kevin and Graehl Jonathan, Machine transliteration. In proceedings of the 35th annual meetings of the Association for Computational Linguistics, pp. 128-135, 1998
  7. Stalls Bonnie Glover and Kevin Knight, Translating names and technical terms in Arabic text. 1998
  8. Al-Onaizan Y, Knight K, Machine translation of names in Arabic text. Proceedings of the ACL conference workshop on computational approaches to Semitic languages. 2002
  9. Nasreen Abdul Jaleel and Leah S. Larkey, Statistical transliteration for English-Arabic cross language information retrieval. In Proceedings of the 12th international conference on information and knowledge management. pp: 139 – 146, 2003
  10. K Knight, J. Graehl, Machine transliteration , Computational Linguist, pp. 128–135, 1997
  11. S. Y. Jung,, S. Hong, S. , E. Paek,. English to Korean transliteration model of extended Markov window, In Proceedings of the 18th Conference on Computational Linguistics, pp. 383–389, 2003
  12. R. K. Joshi, K. Shroff , S. P. Mudur, A Phonemic Code Based Scheme for Effective Processing of Indian Languages 23rd Internationalization and Unicode Conference, Prague, Czech Republic, 1 March 2003.
  13. M. Ganapathiraju, M. Balakrishnan, N. Balakrishnan, R. Reddy. OM: One Tool for Many (Indian) Languages. ICUDL: International Conference on Universal Digital Library, Hangzhou, 2005.
  14. M. G. A. Malik, Punjabi Machine Transliteration, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, pages 1137–1144, 2006
  15. R Sproat. Brahmi scripts, In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2002.
  16. R. Sproat, A formal computational analysis of Indic scripts, In International Symposium on Indic Scripts: Past and Future, Tokyo, Dec. 2003.
  17. R. Sproat, A computational theory of writing systems, In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2004.
  18. M. Kopytonenko, K. Lyytinen, and T. Krkkinen, "Comparison of phonological representations for the grapheme-to-phoneme mapping", In Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, Nijmegen, The Netherlands, 2006.
  19. Ganesh S, Harsha S, Pingali P, and Verma V, Statistical transliteration for cross language information retrieval using HMM alignment and CRF. In Proceedings of the Workshop on CLIA, Addressing the Needs of Multilingual Societies, 2008
  20. Sujan Kumar Saha, Partha Sarathi Ghosh, Sudeshna Sarkar, and Pabitra Mitra, Named entity recognition in Hindi using maximum entropy and transliteration, 2008
  21. A Ekbal and S. Bandyopadhyay, A hidden Markov model based named entity recognition system: Bengali and Hindi as case studies, Proceedings of 2nd International conference in Pattern Recognition and Machine Intelligence, Kolkata, India, pp. 545–552, 2007
  22. A Ekbal and S. Bandyopadhyay, Bengali named entity recognition using support vector machine, in Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian languages, Hyderabad, India, pp. 51–58, January 2008
  23. A Ekbal and S. Bandyopadhyay, Development of Bengali named entity tagged corpus and its use in NER system, in Proceedings of the 6th Workshop on Asian Language Resources, 2008.
  24. A Ekbal and S. Bandyopadhyay, A web-based Bengali news corpus for named entity recognition, Language Resources & Evaluation, vol. 42, pp. 173–182, 2008.
  25. A Ekbal and S. Bandyopadhyay, Improving the performance of a NER system by post-processing and voting, in Proceedings of Joint IAPR International Workshop on Structural Syntactic and Statistical Pattern Recognition, Orlando, Florida, pp. 831–841, 2008
  26. A Ekbal and S. Bandyopadhyay, Bengali Named Entity Recognition using Classifier Combination, in Proceedings of Seventh International Conference on Advances in Pattern Recognition, pp. 259–262, 2009
  27. A Ekbal and S. Bandyopadhyay, Voted NER system using appropriate unlabelled data, in Proceedings of the Named Entities Workshop, ACL-IJCNLP 2009,
  28. A Ekbal and S. Bandyopadhyay, Named entity recognition using appropriate unlabeled data, post-processing and voting. In Informatica, Volume (34), No. 1, pp. 55-76, 2010.
  29. Manoj K. Chinnakotla, Om P. Damani, and Avijit Satoskar, Transliteration for Resource-Scarce Languages, ACM Trans. Asian Lang. Inform. Process. 9, 4, Article 14, pp 1-30, December 2010
  30. Jong-Hoon Oh, Kiyotaka Uchimoto, and Kentaro Torisawa, Machine transliteration using target-language grapheme and phoneme: Multi-engine transliteration approach, Proceedings of the Named Entities Workshop, ACL-IJCNLP Suntec, Singapore,AFNLP, pp. 36–39, August 2009
  31. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data,. In International Conference on Machine Learning, 2001.
  32. Hanna M. Wallach, Conditional Random Fields: An introduction, University of Pennsylvania CIS Technical Report MS-CIS-04-21, February , 2004
  33. Charles Sutton and Andrew McCallum, An Introduction to conditional random fields for relational learning, University of Massachusetts, USA
  34. http://www. whereincity. com/babynames
  35. http://en. wikipedia. org/wiki/list_of_cities_in_India
  36. http://www. indianchild. com/
  37. http://encyclopedia. thefreedictionary. com/
  38. Road Atlas Rajasthan – by Government of India, 2008
  39. Road Atlas Utter Pradesh – by Government of India, 2008
  40. Road Atlas Jharkhand – by Government of India, 2008
  41. Road Atlas Bihar – by Government of India, 2008
  42. Road Atlas Madya Pradesh – by Government of India, 2008
  43. Road Atlas Maharashtra – by Government of India, 2008
  44. Tourist Guide India - by Government of India, 2008
  45. Tourist Guide Maharashtra - by Government of India, 2008
  46. Haizhou Li, A Kumaran, Vladimir Pervouchine and Min Zhang, Report of NEWS 2009 Machine Transliteration Shared Task, ACL-IJCNLP, pp. 1-19, 2009
Index Terms

Computer Science
Information Sciences

Keywords

Bigram Conditional Random Fields Trigram Transliteration Syllabification