International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 52 - Number 5 |
Year of Publication: 2012 |
Authors: Neepa Shah, Sunita Mahajan |
10.5120/8202-1598 |
Neepa Shah, Sunita Mahajan . Semantic based Document Clustering: A Detailed Review. International Journal of Computer Applications. 52, 5 ( August 2012), 42-52. DOI=10.5120/8202-1598
Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low inter-cluster similarity. The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms does not consider the semantic relationships which produce unsatisfactory clustering results. Since last three-four years efforts have been seen in applying semantics to document clustering. Here, an exhaustive and detailed review of more than thirty semantic driven document clustering methods is presented. After an introduction to the document clustering and its basic requirements for improvement, traditional algorithms are overviewed. Also, semantic similarity measures are explained. The article then discusses algorithms that make semantic interpretation of documents for clustering. The semantic approach applied, datasets used, evaluation parameters applied, limitations and future work of all these approaches is presented in tabular format for easy and quick interpretation.