CFP last date
20 December 2024
Reseach Article

Text Clustering using a WordNet-based Knowledge-Base and the Lesk Algorithm

by Jyotirmayee Choudhury, Deepesh Kumar Kimtani, Alok Chakrabarty
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 48 - Number 21
Year of Publication: 2012
Authors: Jyotirmayee Choudhury, Deepesh Kumar Kimtani, Alok Chakrabarty
10.5120/7480-0545

Jyotirmayee Choudhury, Deepesh Kumar Kimtani, Alok Chakrabarty . Text Clustering using a WordNet-based Knowledge-Base and the Lesk Algorithm. International Journal of Computer Applications. 48, 21 ( June 2012), 20-24. DOI=10.5120/7480-0545

@article{ 10.5120/7480-0545,
author = { Jyotirmayee Choudhury, Deepesh Kumar Kimtani, Alok Chakrabarty },
title = { Text Clustering using a WordNet-based Knowledge-Base and the Lesk Algorithm },
journal = { International Journal of Computer Applications },
issue_date = { June 2012 },
volume = { 48 },
number = { 21 },
month = { June },
year = { 2012 },
issn = { 0975-8887 },
pages = { 20-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume48/number21/7480-0545/ },
doi = { 10.5120/7480-0545 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:45:03.050938+05:30
%A Jyotirmayee Choudhury
%A Deepesh Kumar Kimtani
%A Alok Chakrabarty
%T Text Clustering using a WordNet-based Knowledge-Base and the Lesk Algorithm
%J International Journal of Computer Applications
%@ 0975-8887
%V 48
%N 21
%P 20-24
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In this paper we are proposing a text clustering method based on a well-known Word Sense Disambiguation (WSD) algorithm, the Lesk algorithm, to classify textual data by doing highly accurate Word Sense Disambiguation. The clustering of text data is thus primarily based on the context or meaning of the words used for clustering. The Lesk algorithm is used to return the sense identifiers for the words used to classify the text files by looking up the senses of a word in a Knowledge-Base similar to the English WordNet (enriched with more informative columns or fields for each synset [synonym set] of the English WordNet database), so as to greatly increase the chances of contextual overlap, thereby resulting in high accuracy of proper sense or context identification of the words. The proposed scheme has been tested on a number of heterogeneous text document datasets. The clustering results and accuracies, obtained using the proposed scheme, have been compared with the results obtained using the K-means clustering algorithm on the Vector Space Models generated for all the heterogeneous textual datasets. Experimental results show that our algorithm performs much better than the Vector Space Model (VSM) and K-means based approach. The technique will thus help the users much better in searching for meaningful contextual information from a highly diversified collection of textual information, which is a key task of the information overload problem.

References
  1. Jing, L. , Ng, M. K. , Yang, X. , and Huang, J. Z. , 2006. A Text Clustering System based on k-means Type Subspace Clustering and Ontology. International Journal of Intelligent Technology. 1(2), 91-103.
  2. Kumar, N. K. , Santosh, K. G. S. , and Varma, V. 2011. Multilingual document clustering using wikipedia as external knowledge. In Proceedings of the second International Conference on Multidisciplinary Information Retrieval Facility (IRFC' 11). Allan, H. ; Rauber, A. ; Vries, A. P. D. (Eds. ). Springer-Verlag, Berlin, Heidelberg. 108-117.
  3. Liu, Y. , Scheuermann, P. , Li, X. , and Zhu, X. 2007. Using WordNet to Disambiguate Word Senses for Text Classification. In Workshop on Text Data Mining in conjunction with 7th International Conference on Computational Science.
  4. Vijayalakshmi, S. , Manimegalai, D. 2006. Query based Text Document Clustering using its Hyponymy Relation. International Journal of Computer Applications. 23(1)(June 2011), 13-16.
  5. Mavroeidis, D. , Tsatsaronis, G. , Vazirgiannis, M. , Theobald, M. , and Weikum, G. 2005. Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text classification. In Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, Springer. vol. 3721 of Lecture Notes in Computer Science. 181-192.
  6. Hidalgo, J. M. G. , Rodriguez, M. D. B. , and Perez, J. C. C. 2005. The Role of Word Sense Disambiguation in Automated Text Categorization. Montoyo, A. ; Muñoz, R. ; Métais, E. (Eds. ), Natural Language Processing and Information Systems: 10th International Conference on Applications of Natural Language to Information Systems, (NLDB' 05), Alicante, Spain, June 15-17. Proceedings, Lecture Notes in Computer Science, vol. 3513, Springer. 298-309.
  7. Steinbach, M. , Karypis, G. , and Kumar, V. 2000. A Comparison of Document Clustering Techniques. Proc. KDD-2000 Workshop on Text Mining, Aug 2000.
Index Terms

Computer Science
Information Sciences

Keywords

Lesk Synset Wsd Knowledge Base K-means Vector Space Model Context Wordnet