CFP last date
20 January 2025
Reseach Article

Semantic based Document Clustering: A Detailed Review

by Neepa Shah, Sunita Mahajan
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 52 - Number 5
Year of Publication: 2012
Authors: Neepa Shah, Sunita Mahajan
10.5120/8202-1598

Neepa Shah, Sunita Mahajan . Semantic based Document Clustering: A Detailed Review. International Journal of Computer Applications. 52, 5 ( August 2012), 42-52. DOI=10.5120/8202-1598

@article{ 10.5120/8202-1598,
author = { Neepa Shah, Sunita Mahajan },
title = { Semantic based Document Clustering: A Detailed Review },
journal = { International Journal of Computer Applications },
issue_date = { August 2012 },
volume = { 52 },
number = { 5 },
month = { August },
year = { 2012 },
issn = { 0975-8887 },
pages = { 42-52 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume52/number5/8202-1598/ },
doi = { 10.5120/8202-1598 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:51:31.784942+05:30
%A Neepa Shah
%A Sunita Mahajan
%T Semantic based Document Clustering: A Detailed Review
%J International Journal of Computer Applications
%@ 0975-8887
%V 52
%N 5
%P 42-52
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low inter-cluster similarity. The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms does not consider the semantic relationships which produce unsatisfactory clustering results. Since last three-four years efforts have been seen in applying semantics to document clustering. Here, an exhaustive and detailed review of more than thirty semantic driven document clustering methods is presented. After an introduction to the document clustering and its basic requirements for improvement, traditional algorithms are overviewed. Also, semantic similarity measures are explained. The article then discusses algorithms that make semantic interpretation of documents for clustering. The semantic approach applied, datasets used, evaluation parameters applied, limitations and future work of all these approaches is presented in tabular format for easy and quick interpretation.

References
  1. David Sánchez, Montserrat Batet, David Isern, Aida Valls, "Ontology-based semantic similarity: A new feature-based approach," Expert Systems with Applications, Vol. 39, Issue 9, pp. 7718-7728, Jul. 2012
  2. Andreas Hotho , Steffen Staab , Gerd Stumme, "Wordnet improves Text Document Clustering," In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003
  3. Stanislaw Osinski, Dawid Weiss, "A Concept-Driven Algorithm for Clustering Search Results," in Journal of IEEE Intelligent Systems, Vol. 20 Issue 3, pp. 48-54, May 2005
  4. Loulwah AlSumait, Carlotta Domeniconi, "Local Semantic Kernels for Text Document Clustering," In Workshop on Text Mining, SIAM International Conference on Data Mining, 2007
  5. B. Choudhary, P. Bhattacharyya, "Text clustering using semantics," in Proc of the 11th International World Wide Web Conference, 2002
  6. Yanjun Li, Soon M. Chung, John D. Holt, "Text document clustering based on frequent word meaning sequences," Journal of Data and Knowledge Engineering, Vol. 64, Issue 1, Jan. 2008, pp. 381-404
  7. Chun-Ling Chen, Frank S. C. Tseng, Tyne Liang, "An integration of WordNet and fuzzy association rule mining for multi-label document clustering," Journal of Data and Knowledge Engineering, Vol. 69, Issue 11, pp. 1208-1226, Nov. 2010
  8. Chun-Ling Chen, Frank S. Tseng, Tyne Liang, "An Integration of Fuzzy Association Rules and WordNet for Document Clustering," In Proc. of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD, pp. 147-159, 2009
  9. O. Zamir, O. Etzioni, O. Madani, R. M. Karp, "Fast and intuitive clustering of web documents," in Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287–290, 1997
  10. O. Zamir, O. Etzioni, "Web document clustering: a feasibility demonstration," in Proc. of Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54, 1998
  11. Rekha Baghel, Dr. Renu Dhir, "A Frequent Concepts Based Document Clustering Algorithm," International Journal of Computer Applications (0975 – 8887), Vol. 4 – No. 5, Jul. 2010
  12. Hai-Tao Zheng, Bo-Yeong Kang, Hong-Gee Kim, "Exploiting noun phrases and semantic relationships for text document clustering," Journal of Information Sciences, Vol. 179, Issue 13, pp. 2249-2262, Jun. 2009
  13. G. A. Miller, WordNet: a lexical database for English, Commun. ACM 38 (11), 39–41, 1995.
  14. Yong Wang, Julia Hodges, "Document Clustering with Semantic Analysis," In Proc. of the 39th Annual Hawaii International Conference on System Sciences, HICSS, Vol. 03, pp. 54. 3, 2006
  15. Wei Song, Cheng Hua Li, Soon Cheol Park, "Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity," Journal of Expert Systems with Applications, Vol. 36, Issue 5, pp. 9095-9104, Jul. 2009
  16. Muhammad Rafi, M. Shahid Shaikh, Amir Farooq, "Document Clustering based on Topic Maps," International Journal of Computer Applications (0975 – 8887), Vol. 12– No. 1, Dec. 2010
  17. Andreas Hotho, Alexander Maedche, Steffen Staab, "Ontology-based Text Document Clustering," K¨unstliche Intelligenz, Vol. 16, No. 4, pp. 48–54, April 2002
  18. Vladimir Dobrynin, David Patterson, Niall Rooney, "Contextual Document Clustering," In 26th European Conference on IR Research, ECIR, pp. 167-180, 2004
  19. Farial Shahnaz, Michael W. Berry, V. Paul Pauca, Robert J. Plemmons, "Document clustering using nonnegative matrix factorization," Information Processing and Management, Vol. 42, Issue 2, pp. 373-386, Mar. 2006
  20. Chihli Hung, Stefan Wermter, Peter Smith, "Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet," Journal IEEE Intelligent Systems archive, Vol. 19 Issue 2, pp. 68-77, Mar. 2004
  21. Elena Montañés, Irene Díaz, José Ranilla, Elías F. Combarro, and Javier Fernández, "Scoring and Selecting Terms for Text Categorization," Journal of IEEE Intelligent Systems, Vol. 20 Issue 3, pp. 40-47, May 2005
  22. Niall Rooney, David Patterson, Mykola Galushka, Vladimir Dobrynin, "A scaleable document clustering approach for large document corpora," Journal of Information Processing & Management, Vol. 42, Issue 5, pp. 1163-1175, Sep. 2006
  23. Deng Cai, Xiaofei He, and Jiawei Han, Senior Member, "Document Clustering Using Locality Preserving Indexing," IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 12, Dec. 2005
  24. Giansalvatore Mecca, Salvatore Raunich, Alessandro Pappalardo, "A new algorithm for clustering search results," Journal of Data and Knowledge Engineering, Vol. 62, Issue 3, pp. 504-522, Sep. 2007
  25. Kevin Lind, "Concept Based Document Clustering using a Simplicial Complex, a Hypergraph," Master's Thesis, Jan. 2006
  26. Wei Song, Soon Cheol Park, "Genetic algorithm for text clustering based on latent semantic indexing," Computers and Mathematics with Applications, Vol. 57, Issues 11–12, pp. 1901-1907, Jun. 2009
  27. Hai-Tao Zheng, Charles Borchert, Hong-Gee Kim, "GOClonto: An ontological clustering approach for conceptualizing PubMed abstracts," Journal of Biomedical Informatics, Vol. 43, Issue 1, pp. 31-40, Feb. 2010
  28. Shehata, S. Karray, F. Kamel, M. S. , "An Efficient Concept-Based Mining Model for Enhancing Text Clustering," IEEE Transactions on Knowledge and Data Engineering, Vol. : 22, Issue: 10, pp. 1360 – 1371, Oct. 2010
  29. Young-Min Kim, "Document Clustering in a Learned Concept Space," Ph. D. Thesis, Dec. 2010
  30. S. C. Punitha, M. Punithavalli, "Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques," Procedia Engineering, Vol. 30, pp. 100-106, 2012
  31. S. Vijayalakshmi, Dr. D. Manimegalai, "Query based Text Document Clustering using its Hypernymy Relation," International Journal of Computer Applications 23(1):13–16, Jun. 2011
  32. Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez Unanue, Timo Honkela, "Learning a taxonomy from a set of text documents," Applied Soft Computing, Vol. 12, Issue 3, pp. 1138-1148, Mar. 2012
Index Terms

Computer Science
Information Sciences

Keywords

Document clustering semantic based document clustering requirements of document clustering semantic similarity for document clustering