CFP last date
20 February 2025
Reseach Article

Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with

by Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 38 - Number 8
Year of Publication: 2012
Authors: Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan
10.5120/4705-6867

Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan . Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with. International Journal of Computer Applications. 38, 8 ( January 2012), 1-8. DOI=10.5120/4705-6867

@article{ 10.5120/4705-6867,
author = { Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan },
title = { Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with },
journal = { International Journal of Computer Applications },
issue_date = { January 2012 },
volume = { 38 },
number = { 8 },
month = { January },
year = { 2012 },
issn = { 0975-8887 },
pages = { 1-8 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume38/number8/4705-6867/ },
doi = { 10.5120/4705-6867 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:24:19.346633+05:30
%A Deepa Gupta
%A Rahul Kumar Yadav
%A Nidhi Sajan
%T Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with
%J International Journal of Computer Applications
%@ 0975-8887
%V 38
%N 8
%P 1-8
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Stemming and Lemmatization are two important natural language processing techniques widely used in Information Retrieval (IR) for query processing and in Machine Translation (MT) for reducing the data sparseness. Both minimize inflectional forms, and sometimes derivationally related forms of a word, to a common base form. Most of the existing stemmer and lemmatization work is based either on some language dependent rules which require the supervision of a language expert, or some probabilistic approach that needs vast amount of monolingual corpus, both of which develop stemming and lemmatization algorithms independently. In our work, we propose an unsupervised stemming which is hybridized with partial lemmatization for Hindi. The stemmer proposed is unique in that it exploits a novel grouping criteria & aims to improve unsupervised stemming and most importantly avoid over-stemming problem which is a usual phenomena in stemming. The later is tackled by the introduction of lemma. We incorporated lemmatization based on data heuristics obtained from the corpus, without the use of word class information. Application of this concept to unsupervised stemming yielded significant improvements in the desired results when compared to other prevailing approaches of its genre.

References
  1. John goldsmith, 2001 , Unsupervised learning of the morphology of a Natural language, Computational Linguistics, Volume 27, No. 2 pp. 153-198 , 2001
  2. Amaresh Kumar Pandey, Tanveer J Siddiqui, 2008, An unsupervised Hindi Stemmer with heuristic improvements, In Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008,pp 99-105, Singapore.
  3. A Ramanathan and D.D Rao, 2003. A Light weight Stemmer for Hindi. In processing of the 10th conference EACL, on Computational Lingusitics for South Asian Language, Budapest Hungary.
  4. Dinesh Kumar and Prince Rana, 2001, Stemming of punjabi words by using brute force technique, International Journal of Engineering Science and Technology (IJEST).
  5. Shambhavi. B. R, Dr. Ramakanth Kumar P, Srividya K, Jyothi B J, Spoorti Kundargi, Varsha and Shastri G, 2011, Kannada Morphological Analyser and Generator Using Trie , IJCSNS International Journal of Computer Science and Network Security
  6. A.K. Jain, M.N. Murty and P.J. Flynn, 2000, Data Clustering: A Review, ACM Computing Surveys.
  7. Information Retrieval, Chapter 16,17, 2009, Cambridge University Press
  8. Prasenjit Majumder, Mandar Mitra, Swapan K.Parui and Gobinda Kole, 2007, YASS: Yet Another Suffix Stripper, ACM Transactions on Information Systems,Vol.25, No.4, Article 18
  9. KVN Sunitha and N Kalyani, Sadhana, 2009 Improving Word Coverage using unsupervised morphological analyzer , Indian Academy of science
  10. M. F. Porter, 2006 An algorithm for suffix stripping, Program: electronic library and information systems, vol.40, pp. 211-218,
  11. Antoni Oliver, Marko Tadic. 2004. Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition From Raw Corpora. In Proceedings of LREC 2004, Lisbon,Portugal.
  12. Benoit Sagot. 2007. Building a Morpho-syntactic Lexicon and a Pre-syntactic Processing Chain for Polish”. In Proceedings of LTC 2007,Poznan, Poland. \
  13. Markus Forsberg, Harald Hammarstrom and Aarne Ranta. 2006. Morphological Lexicon Extraction from Raw Test Data. In Proceedings of the 5th International Conference on Advances in Natural Language Processing.
Index Terms

Computer Science
Information Sciences

Keywords

Stemming Lemmatization Hindi Over-stemming Under-stemming Clustering Data-based heuristics.