Semantic based Document Clustering: A Detailed Review

Neepa Shah; Sunita Mahajan

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Semantic based Document Clustering: A Detailed Review

by Neepa Shah, Sunita Mahajan

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 52 - Number 5

Year of Publication: 2012

Authors: Neepa Shah, Sunita Mahajan

10.5120/8202-1598

Neepa Shah, Sunita Mahajan . Semantic based Document Clustering: A Detailed Review. International Journal of Computer Applications. 52, 5 ( August 2012), 42-52. DOI=10.5120/8202-1598

@article{ 10.5120/8202-1598,

author = { Neepa Shah, Sunita Mahajan },

title = { Semantic based Document Clustering: A Detailed Review },

journal = { International Journal of Computer Applications },

issue_date = { August 2012 },

volume = { 52 },

number = { 5 },

month = { August },

year = { 2012 },

issn = { 0975-8887 },

pages = { 42-52 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume52/number5/8202-1598/ },

doi = { 10.5120/8202-1598 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:51:31.784942+05:30

%A Neepa Shah

%A Sunita Mahajan

%T Semantic based Document Clustering: A Detailed Review

%J International Journal of Computer Applications

%@ 0975-8887

%V 52

%N 5

%P 42-52

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low inter-cluster similarity. The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms does not consider the semantic relationships which produce unsatisfactory clustering results. Since last three-four years efforts have been seen in applying semantics to document clustering. Here, an exhaustive and detailed review of more than thirty semantic driven document clustering methods is presented. After an introduction to the document clustering and its basic requirements for improvement, traditional algorithms are overviewed. Also, semantic similarity measures are explained. The article then discusses algorithms that make semantic interpretation of documents for clustering. The semantic approach applied, datasets used, evaluation parameters applied, limitations and future work of all these approaches is presented in tabular format for easy and quick interpretation.

References

David Sánchez, Montserrat Batet, David Isern, Aida Valls, "Ontology-based semantic similarity: A new feature-based approach," Expert Systems with Applications, Vol. 39, Issue 9, pp. 7718-7728, Jul. 2012
Andreas Hotho , Steffen Staab , Gerd Stumme, "Wordnet improves Text Document Clustering," In Proc. of the SIGIR 2003 Semantic Web Workshop, 2003
Stanislaw Osinski, Dawid Weiss, "A Concept-Driven Algorithm for Clustering Search Results," in Journal of IEEE Intelligent Systems, Vol. 20 Issue 3, pp. 48-54, May 2005
Loulwah AlSumait, Carlotta Domeniconi, "Local Semantic Kernels for Text Document Clustering," In Workshop on Text Mining, SIAM International Conference on Data Mining, 2007
B. Choudhary, P. Bhattacharyya, "Text clustering using semantics," in Proc of the 11th International World Wide Web Conference, 2002
Yanjun Li, Soon M. Chung, John D. Holt, "Text document clustering based on frequent word meaning sequences," Journal of Data and Knowledge Engineering, Vol. 64, Issue 1, Jan. 2008, pp. 381-404
Chun-Ling Chen, Frank S. C. Tseng, Tyne Liang, "An integration of WordNet and fuzzy association rule mining for multi-label document clustering," Journal of Data and Knowledge Engineering, Vol. 69, Issue 11, pp. 1208-1226, Nov. 2010
Chun-Ling Chen, Frank S. Tseng, Tyne Liang, "An Integration of Fuzzy Association Rules and WordNet for Document Clustering," In Proc. of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD, pp. 147-159, 2009
O. Zamir, O. Etzioni, O. Madani, R. M. Karp, "Fast and intuitive clustering of web documents," in Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287–290, 1997
O. Zamir, O. Etzioni, "Web document clustering: a feasibility demonstration," in Proc. of Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54, 1998
Rekha Baghel, Dr. Renu Dhir, "A Frequent Concepts Based Document Clustering Algorithm," International Journal of Computer Applications (0975 – 8887), Vol. 4 – No. 5, Jul. 2010
Hai-Tao Zheng, Bo-Yeong Kang, Hong-Gee Kim, "Exploiting noun phrases and semantic relationships for text document clustering," Journal of Information Sciences, Vol. 179, Issue 13, pp. 2249-2262, Jun. 2009
G. A. Miller, WordNet: a lexical database for English, Commun. ACM 38 (11), 39–41, 1995.
Yong Wang, Julia Hodges, "Document Clustering with Semantic Analysis," In Proc. of the 39th Annual Hawaii International Conference on System Sciences, HICSS, Vol. 03, pp. 54. 3, 2006
Wei Song, Cheng Hua Li, Soon Cheol Park, "Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity," Journal of Expert Systems with Applications, Vol. 36, Issue 5, pp. 9095-9104, Jul. 2009
Muhammad Rafi, M. Shahid Shaikh, Amir Farooq, "Document Clustering based on Topic Maps," International Journal of Computer Applications (0975 – 8887), Vol. 12– No. 1, Dec. 2010
Andreas Hotho, Alexander Maedche, Steffen Staab, "Ontology-based Text Document Clustering," K¨unstliche Intelligenz, Vol. 16, No. 4, pp. 48–54, April 2002
Vladimir Dobrynin, David Patterson, Niall Rooney, "Contextual Document Clustering," In 26th European Conference on IR Research, ECIR, pp. 167-180, 2004
Farial Shahnaz, Michael W. Berry, V. Paul Pauca, Robert J. Plemmons, "Document clustering using nonnegative matrix factorization," Information Processing and Management, Vol. 42, Issue 2, pp. 373-386, Mar. 2006
Chihli Hung, Stefan Wermter, Peter Smith, "Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet," Journal IEEE Intelligent Systems archive, Vol. 19 Issue 2, pp. 68-77, Mar. 2004
Elena Montañés, Irene Díaz, José Ranilla, Elías F. Combarro, and Javier Fernández, "Scoring and Selecting Terms for Text Categorization," Journal of IEEE Intelligent Systems, Vol. 20 Issue 3, pp. 40-47, May 2005
Niall Rooney, David Patterson, Mykola Galushka, Vladimir Dobrynin, "A scaleable document clustering approach for large document corpora," Journal of Information Processing & Management, Vol. 42, Issue 5, pp. 1163-1175, Sep. 2006
Deng Cai, Xiaofei He, and Jiawei Han, Senior Member, "Document Clustering Using Locality Preserving Indexing," IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 12, Dec. 2005
Giansalvatore Mecca, Salvatore Raunich, Alessandro Pappalardo, "A new algorithm for clustering search results," Journal of Data and Knowledge Engineering, Vol. 62, Issue 3, pp. 504-522, Sep. 2007
Kevin Lind, "Concept Based Document Clustering using a Simplicial Complex, a Hypergraph," Master's Thesis, Jan. 2006
Wei Song, Soon Cheol Park, "Genetic algorithm for text clustering based on latent semantic indexing," Computers and Mathematics with Applications, Vol. 57, Issues 11–12, pp. 1901-1907, Jun. 2009
Hai-Tao Zheng, Charles Borchert, Hong-Gee Kim, "GOClonto: An ontological clustering approach for conceptualizing PubMed abstracts," Journal of Biomedical Informatics, Vol. 43, Issue 1, pp. 31-40, Feb. 2010
Shehata, S. Karray, F. Kamel, M. S. , "An Efficient Concept-Based Mining Model for Enhancing Text Clustering," IEEE Transactions on Knowledge and Data Engineering, Vol. : 22, Issue: 10, pp. 1360 – 1371, Oct. 2010
Young-Min Kim, "Document Clustering in a Learned Concept Space," Ph. D. Thesis, Dec. 2010
S. C. Punitha, M. Punithavalli, "Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques," Procedia Engineering, Vol. 30, pp. 100-106, 2012
S. Vijayalakshmi, Dr. D. Manimegalai, "Query based Text Document Clustering using its Hypernymy Relation," International Journal of Computer Applications 23(1):13–16, Jun. 2011
Mari-Sanna Paukkeri, Alberto Pérez García-Plaza, Víctor Fresno, Raquel Martínez Unanue, Timo Honkela, "Learning a taxonomy from a set of text documents," Applied Soft Computing, Vol. 12, Issue 3, pp. 1138-1148, Mar. 2012

Index Terms

Computer Science

Information Sciences

Keywords

Document clustering semantic based document clustering requirements of document clustering semantic similarity for document clustering