Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with

Deepa Gupta; Rahul Kumar Yadav; Nidhi Sajan

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with

by Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 38 - Number 8

Year of Publication: 2012

Authors: Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan

10.5120/4705-6867

Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan . Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with. International Journal of Computer Applications. 38, 8 ( January 2012), 1-8. DOI=10.5120/4705-6867

@article{ 10.5120/4705-6867,

author = { Deepa Gupta, Rahul Kumar Yadav, Nidhi Sajan },

title = { Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with },

journal = { International Journal of Computer Applications },

issue_date = { January 2012 },

volume = { 38 },

number = { 8 },

month = { January },

year = { 2012 },

issn = { 0975-8887 },

pages = { 1-8 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume38/number8/4705-6867/ },

doi = { 10.5120/4705-6867 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:24:19.346633+05:30

%A Deepa Gupta

%A Rahul Kumar Yadav

%A Nidhi Sajan

%T Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with

%J International Journal of Computer Applications

%@ 0975-8887

%V 38

%N 8

%P 1-8

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Stemming and Lemmatization are two important natural language processing techniques widely used in Information Retrieval (IR) for query processing and in Machine Translation (MT) for reducing the data sparseness. Both minimize inflectional forms, and sometimes derivationally related forms of a word, to a common base form. Most of the existing stemmer and lemmatization work is based either on some language dependent rules which require the supervision of a language expert, or some probabilistic approach that needs vast amount of monolingual corpus, both of which develop stemming and lemmatization algorithms independently. In our work, we propose an unsupervised stemming which is hybridized with partial lemmatization for Hindi. The stemmer proposed is unique in that it exploits a novel grouping criteria & aims to improve unsupervised stemming and most importantly avoid over-stemming problem which is a usual phenomena in stemming. The later is tackled by the introduction of lemma. We incorporated lemmatization based on data heuristics obtained from the corpus, without the use of word class information. Application of this concept to unsupervised stemming yielded significant improvements in the desired results when compared to other prevailing approaches of its genre.

References

John goldsmith, 2001 , Unsupervised learning of the morphology of a Natural language, Computational Linguistics, Volume 27, No. 2 pp. 153-198 , 2001
Amaresh Kumar Pandey, Tanveer J Siddiqui, 2008, An unsupervised Hindi Stemmer with heuristic improvements, In Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008,pp 99-105, Singapore.
A Ramanathan and D.D Rao, 2003. A Light weight Stemmer for Hindi. In processing of the 10th conference EACL, on Computational Lingusitics for South Asian Language, Budapest Hungary.
Dinesh Kumar and Prince Rana, 2001, Stemming of punjabi words by using brute force technique, International Journal of Engineering Science and Technology (IJEST).
Shambhavi. B. R, Dr. Ramakanth Kumar P, Srividya K, Jyothi B J, Spoorti Kundargi, Varsha and Shastri G, 2011, Kannada Morphological Analyser and Generator Using Trie , IJCSNS International Journal of Computer Science and Network Security
A.K. Jain, M.N. Murty and P.J. Flynn, 2000, Data Clustering: A Review, ACM Computing Surveys.
Information Retrieval, Chapter 16,17, 2009, Cambridge University Press
Prasenjit Majumder, Mandar Mitra, Swapan K.Parui and Gobinda Kole, 2007, YASS: Yet Another Suffix Stripper, ACM Transactions on Information Systems,Vol.25, No.4, Article 18
KVN Sunitha and N Kalyani, Sadhana, 2009 Improving Word Coverage using unsupervised morphological analyzer , Indian Academy of science
M. F. Porter, 2006 An algorithm for suffix stripping, Program: electronic library and information systems, vol.40, pp. 211-218,
Antoni Oliver, Marko Tadic. 2004. Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition From Raw Corpora. In Proceedings of LREC 2004, Lisbon,Portugal.
Benoit Sagot. 2007. Building a Morpho-syntactic Lexicon and a Pre-syntactic Processing Chain for Polish”. In Proceedings of LTC 2007,Poznan, Poland. \
Markus Forsberg, Harald Hammarstrom and Aarne Ranta. 2006. Morphological Lexicon Extraction from Raw Test Data. In Proceedings of the 5th International Conference on Advances in Natural Language Processing.

Index Terms

Computer Science

Information Sciences

Keywords

Stemming Lemmatization Hindi Over-stemming Under-stemming Clustering Data-based heuristics.