CFP last date
20 February 2025
Reseach Article

Graph based Text Document Clustering by Detecting Initial Centroids for k-Means

by Vikas Kumar Sihag, Subhash Kumar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 62 - Number 19
Year of Publication: 2013
Authors: Vikas Kumar Sihag, Subhash Kumar
10.5120/10185-5005

Vikas Kumar Sihag, Subhash Kumar . Graph based Text Document Clustering by Detecting Initial Centroids for k-Means. International Journal of Computer Applications. 62, 19 ( January 2013), 1-4. DOI=10.5120/10185-5005

@article{ 10.5120/10185-5005,
author = { Vikas Kumar Sihag, Subhash Kumar },
title = { Graph based Text Document Clustering by Detecting Initial Centroids for k-Means },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 62 },
number = { 19 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 1-4 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume62/number19/10185-5005/ },
doi = { 10.5120/10185-5005 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:12:13.051736+05:30
%A Vikas Kumar Sihag
%A Subhash Kumar
%T Graph based Text Document Clustering by Detecting Initial Centroids for k-Means
%J International Journal of Computer Applications
%@ 0975-8887
%V 62
%N 19
%P 1-4
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document clustering is used in information retrieval to organize a large collection of text documents into some meaningful clusters. k-means clustering algorithm of pratitional category, performs well on document clustering. k-means organizes a large collection of items into k clusters so that a criterion function is optimized. As it is sensitive to the initial values of cluster centroids, this paper proposes a graph based method to calculate the appropriate initial cluster centroids. Document collection is represented as a graphical network in which a node represents a document and an edge represents the similarity between two documents. In order to calculate initial centroids, community structure present in graphical network is detected using edge deletion technique. Using community structure, centrality of each node is calculated. Centrality value of a node represents its candidature of being a cluster centroid. Use of community structure assures that calculated centroids have sufficient number of topically related documents and centroids are well separated from each other. k-means with these initial centroids provides a significant improvement over simple k-means for text document clustering.

References
  1. Lailil M. and Baharum B. , Document Clustering using Concept Space and Cosine Similarity Measurement, International Conference on Computer Technology and Development, pp. 58-62, 2009.
  2. Samat A. N. , Murad A. , Azrifah M. , and Atan R. , Malay Documents Clustering Algorithm Based On Singular Value Decomposition , Journal of Theoretical and Applied Information Technology, pp. 180-186, 2005-2009.
  3. Kaufman L. , and Rousseeuw P. J. , Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
  4. Van Rijsbergen, and C. J. , Information Retrieval, 2nd edition, Butterwoth 1979.
  5. Jain A. K. , and Dubes R. C. , Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, 1988.
  6. Cutting D. R. , Karger D. R. , Pedersen J. O. , and Tukey J. W. , Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of ACM SIGR Conf. on Research and Development in Information Retrieval, pp. 318-329, 1992 .
  7. Luo Congnan, Li Yanjun, and Chung M Soon. , Text Document Clustering based on Neighbors. Data and Knowledge Engineering 68, pp. 1271-1288, 2009.
  8. Larsen B. , and Aone C. , Fast and effective text mining using linear-time document clustering. In Proceedings of ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp. 16-22, 1999.
  9. Steinbach M. , Karypis G. , and Kumar V. , A comparison of document clustering techniques. , Technical report, Department of Computer Science and Engineering, University of Minnesota, 2000.
  10. Huang A. , Similarity Measures for Text Document Clustering. , Department of Computer Science and Engineering, The University ofWaikato, New Zealand, NZCSRSC 2008, April 2008.
  11. Lerman K. , Document Clustering in Reduced Dimension Vector Space. , USC Information Science Institute, 4676 Admiralty Way, Marina del Rey, CA 90292.
  12. Hotho A. , Staab S. , and Stumme G. , Ontologies improve text document clustering. In Proceedings of IEEE Int'l Conf. on Data Mining, pp. 541-544, 2003.
  13. Li Y. , Chung S. M. , and Holt j. D. , Text document clustering based on frequent word meaning sequences. In Proceedings of the 4th International Conference on Business Process Management, pp. 381-404, 2008.
  14. M. Louis, Evaluating and Comparing Text Clustering Results. Royal Military College.
  15. 20 Usenet newsgroups dataset, kdd. ics. uci. edu/databases
  16. The 4 universities data set, www. cs. cmu. edu/afs/cs. cmu. edu/ project/theo-20/www/data/
Index Terms

Computer Science
Information Sciences

Keywords

Text mining Document clustering Cosine similarity k-means