CFP last date
20 January 2025
Reseach Article

Article:Penn Treebank-Based Syntactic Parsers for South Dravidian Languages using a Machine Learning Approach

by Antony P J, Nandini. J. Warrier, Dr. Soman K P
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 7 - Number 8
Year of Publication: 2010
Authors: Antony P J, Nandini. J. Warrier, Dr. Soman K P
10.5120/1272-1789

Antony P J, Nandini. J. Warrier, Dr. Soman K P . Article:Penn Treebank-Based Syntactic Parsers for South Dravidian Languages using a Machine Learning Approach. International Journal of Computer Applications. 7, 8 ( October 2010), 14-21. DOI=10.5120/1272-1789

@article{ 10.5120/1272-1789,
author = { Antony P J, Nandini. J. Warrier, Dr. Soman K P },
title = { Article:Penn Treebank-Based Syntactic Parsers for South Dravidian Languages using a Machine Learning Approach },
journal = { International Journal of Computer Applications },
issue_date = { October 2010 },
volume = { 7 },
number = { 8 },
month = { October },
year = { 2010 },
issn = { 0975-8887 },
pages = { 14-21 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume7/number8/1272-1789/ },
doi = { 10.5120/1272-1789 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:55:47.093442+05:30
%A Antony P J
%A Nandini. J. Warrier
%A Dr. Soman K P
%T Article:Penn Treebank-Based Syntactic Parsers for South Dravidian Languages using a Machine Learning Approach
%J International Journal of Computer Applications
%@ 0975-8887
%V 7
%N 8
%P 14-21
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

With the availability of limited electronic resources, development of a syntactic parser for all types of sentence forms is a challenging and demanding task for any natural language. This paper presents the development of Penn Treebank based statistical syntactic parsers for two South Dravidian languages namely Kannada and Malayalam. Syntactic parsing is the task of recognizing a sentence and assigning a syntactic structure to it. A syntactic parser is an essential tool used for various natural language processing (NLP) applications and natural language understanding. The well known grammar formalism called Penn Treebank structure was used to create the corpus for proposed statistical syntactic parsers. Both the parsing systems were trained using Treebank based corpus consists of 1,000 Kannada and Malayalam sentences that were carefully constructed. The developed corpus has been already annotated with correct segmentation and Part-Of-Speech (POS) information. We have used our own POS tagger generator for assigning proper tags to each and every word in the training and test sentences. The proposed syntactic parser was implemented using supervised machine learning and probabilistic context free grammars (PCFG) approaches. Training, testing and evaluations were done by support vector method (SVM) algorithms. From the experiment we found that the performance of our systems are significantly well and achieves a very competitive accuracy.

References
  1. Roxana Girju, (2004), “Introduction to Syntactic Parsing”.
  2. Niladri Sekhar Dash, (2004), “Present Indian Need”, Language Corpora.
  3. Antony P J. & Soman K P, (2010) “Kernel Based Part of Speech Tagger for Kannada”, International Conference on Machine Learning and Cybernetics 2010, ICMLC 2010, Qingdao, Shandong, China.
  4. Antony P J, Santhanu P Mohan & Soman K P, (2010), “SVM Based Parts Speech Tagger for Malayalam”, International Conference on-Recent Trends in Information, Telecommunication and Computing (ITC 2010), Kochi, Kerala, India.
  5. Reut Tsarfaty Yoav Goldberg, “Word-Based or Morpheme-Based? Annotation Strategies for Modern Hebrew Clitics”.
  6. Abhishek Arun, (2004), “Statistical Parsing of the French Treebank”, A thesis for Master of Science, Cognitive Science and Natural Language, School of Informatics, University of Edinburgh.
  7. Ayesha Binte Mosaddeque & Nafid Haque, (2004), “Context-Free Grammar for Bangla”, Bangla, Dhaka, Bangladesh.
  8. B.M. Sagar, Shobha G & Ramakanth Kumar P , (2009), “Solving the Noun Phrase and Verb Phrase Agreement in Kannada Sentences ”, International Journal of Computer Theory and Engineering , Vol. 1, No. 3, 1793-8201.
  9. Bala Sundara Raman L, Ishwar S, & Sanjeeth Kumar Ravindranath , (2003), “ Context Free Grammar for Natural Language Constructs – An implementation for Venpa Class of Tamil Poetry ”, I6th International Tamil Internet Conference and Exhibition, Tamil Internet 2003, Chennai,India.
  10. G.V. Singh & D.K. Lobiyal , (1994), “A Computational Grammar For Hindi Verb Phrase ”, IEEE transactions.
  11. Selvam M, Natarajan. A M, and Thangarajan R, (2008), “Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model”, International Journal of Computer, Information, and Systems Science, and Engineering.
  12. www.languageinindia.com Vol 6 : 8 August, 2006.
  13. B. A. Sharada (2002), “Transformation of Natural Language into Indexing Language: Kannada - A Case Study”, Ph.D. Dissertation, Language in India- Strength for Today and Bright Hope for Tomorrow.
  14. T.N. Vikram & Shalini R Urs, (2007), “Development of Prototype Morphological Analyzer for the South Indian Language of Kannada”, Lecture Notes In Computer Science: Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers. Vol. 4822/2007, 109-116.
  15. K Narayana Murthy, “Computer Processing of Kannada Language”, University of Hyderabad.
  16. V Tredinnick, (1995), “Bracketing Guidelines for Treebank II Style Penn Treebank Project”.
  17. Jes´us Gim´enez & Llu´ıs M`arquez, (2006), “SVMTtool: Technical manual”, v1.3.
  18. V.N. Vapnik, (1998), “Statistical Learning Theory : J.Wiley & Sons”, Inc. New York.
  19. Andrew McCallum , (2007), “Introduction to Natural Language Processing”, Lecture 5: Context Free Grammars.
  20. Qaiser Abbas, Nayyara Karamat & Sadia Niazi, (2002), “Development of Tree-bank Based Probabilistic Grammar for Urdu Language”, International Journal of Electrical & Computer Sciences IJECS. Vol: 9 No: 9.
  21. Rebecca F. Watson, (2009), “Optimizing the speed and accuracy of a Statistical GLR Parser”, Technical Report, University of Cambridge.
Index Terms

Computer Science
Information Sciences

Keywords

Penn Treebank Dravidian Languages Syntactic Parser Part-Of-Speech Support Vector Methods