Article:Document Clustering based on Topic Maps

Muhammad Rafi; M. Shahid Shaikh; Amir Farooq

Call for Paper

October Edition

IJCA solicits high quality original research papers for the upcoming October edition of the journal. The last date of research paper submission is 22 September 2025

Submit your paper

Know more

The week's pick

Real-Time Video Transmission using Gaussian Minimum Shift Keying (GMSK) on GNU Radio and USRP for Radiation Monitoring Applications in Nuclear Reactors

Nabiha Ben Abid Abdalla M. Khattab Hani A.M. Harb Chokri Souani

Random Articles

Reseach Article

Article:Document Clustering based on Topic Maps

by Muhammad Rafi, M. Shahid Shaikh, Amir Farooq

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 12 - Number 1

Year of Publication: 2010

Authors: Muhammad Rafi, M. Shahid Shaikh, Amir Farooq

10.5120/1640-2204

Muhammad Rafi, M. Shahid Shaikh, Amir Farooq . Article:Document Clustering based on Topic Maps. International Journal of Computer Applications. 12, 1 ( December 2010), 32-36. DOI=10.5120/1640-2204

@article{ 10.5120/1640-2204,

author = { Muhammad Rafi, M. Shahid Shaikh, Amir Farooq },

title = { Article:Document Clustering based on Topic Maps },

journal = { International Journal of Computer Applications },

issue_date = { December 2010 },

volume = { 12 },

number = { 1 },

month = { December },

year = { 2010 },

issn = { 0975-8887 },

pages = { 32-36 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume12/number1/1640-2204/ },

doi = { 10.5120/1640-2204 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:00:35.208329+05:30

%A Muhammad Rafi

%A M. Shahid Shaikh

%A Amir Farooq

%T Article:Document Clustering based on Topic Maps

%J International Journal of Computer Applications

%@ 0975-8887

%V 12

%N 1

%P 32-36

%D 2010

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.

References

Jain, A. K., Murty, M. N., and Flynn, P. J., "Data Clustering: a review," ACM Computing Survey, pp. 264-323, 1999.
Campi, A. and Ronchi, S., "The Role of Clustering in Search Computing ," in 20th International Workshop on Databases and Expert Systems Application , Linz, Austria, 2009, pp. 432-436.
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W., "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections," in Fifteenth Annual International ACM SIGIR Conference, June 1992, pp. 318-329.
Hearst, M. A. and Pedersen, J. O., "Reexamining the cluster hypothesis: scatter/gather on retrieval results," in 19th annual international ACM SIGIR conference on Research and development in information retrieval, Zurich, Switzerland , 1996, pp. 74-84.
Hammouda, K.M. and Kamel, M.S. , "Efficient Phrase-Based Document Indexing for Web Document Clustering," IEEE Transaction on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1279-1296, 2004.
Hung, C. and Xiaotie, D., "Efficient Phrase-Based Document Similarity for Clustering," IEEE Transaction on Knowledge and Data Engineering, vol. 20, no. September, pp. 1217-1229, 2008.
Fung, B.C.M., Wang, K., and Ester, M., "Hierarchical document clustering using frequent Itemsets,” Proceedings of SIAM International Conference on Data Mining, 2003.
Soon, M. C. , John, D. H., and Yanjun, L., "Text document clustering based on frequent word meaning sequences," Data & Knowledge Engineering, vol. 64, pp. 381-404, 2008.
Pepper, S., “Topic Maps,” Encyclopedia of Library and Information Sciences, Third Edition 2010
Pepper, S.; Moore, G., Eds. XML Topic Maps (XTM) 1.0; TopicMaps.Org 2001, http://www.topicmaps.org/xtm/1.0/
Maicher, L.; Garshol, L.M.; Eds. Scaling Topic Maps; In Third International Conference on Topic Maps Research and Applications, TMRA 2007, Leipzig, Germany, October 2007; Springer-Verlag: Berlin, Heidelberg
Steianbach, M., Karypis, G. , and Kumar, V., "A comparison of document clustering techniques," in KDD-Workshop on Text Mining , 2000.
Rafi, M.,Maujood, M. ,Fazal, M. M., Ali, .M.,“Acomparison of two suffix tree-based document clustering algorithms”, The Second IEEE ICIET,14-16 June Karachi, Pakistan, 2010.
Document on Wandora Implementation and Usage. Can be found at http://www.wandora.org/wandora/wiki/index.php?title=Documentation

Index Terms

Computer Science

Information Sciences

Keywords

Text Document Document Clustering Algorithm Performance measure