Graph based Text Document Clustering by Detecting Initial Centroids for k-Means

Vikas Kumar Sihag; Subhash Kumar

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper

Know more

The week's pick

Real-time Synchronization Mechanisms Between Batch-oriented Legacy Systems and Modern Interfaces in the Retirement Domain

Balamurugan Krishnaswamy Gnanasekaran

Random Articles

Estimation of Population Variance in Simple Random Sampling using Auxiliary Information

Nov

2020

Compiler for Detection of Program Vulnerabilities

October

2014

Color Content based Video Retrieval using Block Truncation Coding with Different Color Spaces

February

2013

A Novel Progressive Sampling based Approach for Effective Mining of Association Rules

November

2010

Reseach Article

Graph based Text Document Clustering by Detecting Initial Centroids for k-Means

by Vikas Kumar Sihag, Subhash Kumar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 62 - Number 19

Year of Publication: 2013

Authors: Vikas Kumar Sihag, Subhash Kumar

10.5120/10185-5005

Vikas Kumar Sihag, Subhash Kumar . Graph based Text Document Clustering by Detecting Initial Centroids for k-Means. International Journal of Computer Applications. 62, 19 ( January 2013), 1-4. DOI=10.5120/10185-5005

@article{ 10.5120/10185-5005,

author = { Vikas Kumar Sihag, Subhash Kumar },

title = { Graph based Text Document Clustering by Detecting Initial Centroids for k-Means },

journal = { International Journal of Computer Applications },

issue_date = { January 2013 },

volume = { 62 },

number = { 19 },

month = { January },

year = { 2013 },

issn = { 0975-8887 },

pages = { 1-4 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume62/number19/10185-5005/ },

doi = { 10.5120/10185-5005 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:12:13.051736+05:30

%A Vikas Kumar Sihag

%A Subhash Kumar

%T Graph based Text Document Clustering by Detecting Initial Centroids for k-Means

%J International Journal of Computer Applications

%@ 0975-8887

%V 62

%N 19

%P 1-4

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Document clustering is used in information retrieval to organize a large collection of text documents into some meaningful clusters. k-means clustering algorithm of pratitional category, performs well on document clustering. k-means organizes a large collection of items into k clusters so that a criterion function is optimized. As it is sensitive to the initial values of cluster centroids, this paper proposes a graph based method to calculate the appropriate initial cluster centroids. Document collection is represented as a graphical network in which a node represents a document and an edge represents the similarity between two documents. In order to calculate initial centroids, community structure present in graphical network is detected using edge deletion technique. Using community structure, centrality of each node is calculated. Centrality value of a node represents its candidature of being a cluster centroid. Use of community structure assures that calculated centroids have sufficient number of topically related documents and centroids are well separated from each other. k-means with these initial centroids provides a significant improvement over simple k-means for text document clustering.

References

Lailil M. and Baharum B. , Document Clustering using Concept Space and Cosine Similarity Measurement, International Conference on Computer Technology and Development, pp. 58-62, 2009.
Samat A. N. , Murad A. , Azrifah M. , and Atan R. , Malay Documents Clustering Algorithm Based On Singular Value Decomposition , Journal of Theoretical and Applied Information Technology, pp. 180-186, 2005-2009.
Kaufman L. , and Rousseeuw P. J. , Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
Van Rijsbergen, and C. J. , Information Retrieval, 2nd edition, Butterwoth 1979.
Jain A. K. , and Dubes R. C. , Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, 1988.
Cutting D. R. , Karger D. R. , Pedersen J. O. , and Tukey J. W. , Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of ACM SIGR Conf. on Research and Development in Information Retrieval, pp. 318-329, 1992 .
Luo Congnan, Li Yanjun, and Chung M Soon. , Text Document Clustering based on Neighbors. Data and Knowledge Engineering 68, pp. 1271-1288, 2009.
Larsen B. , and Aone C. , Fast and effective text mining using linear-time document clustering. In Proceedings of ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pp. 16-22, 1999.
Steinbach M. , Karypis G. , and Kumar V. , A comparison of document clustering techniques. , Technical report, Department of Computer Science and Engineering, University of Minnesota, 2000.
Huang A. , Similarity Measures for Text Document Clustering. , Department of Computer Science and Engineering, The University ofWaikato, New Zealand, NZCSRSC 2008, April 2008.
Lerman K. , Document Clustering in Reduced Dimension Vector Space. , USC Information Science Institute, 4676 Admiralty Way, Marina del Rey, CA 90292.
Hotho A. , Staab S. , and Stumme G. , Ontologies improve text document clustering. In Proceedings of IEEE Int'l Conf. on Data Mining, pp. 541-544, 2003.
Li Y. , Chung S. M. , and Holt j. D. , Text document clustering based on frequent word meaning sequences. In Proceedings of the 4th International Conference on Business Process Management, pp. 381-404, 2008.
M. Louis, Evaluating and Comparing Text Clustering Results. Royal Military College.
20 Usenet newsgroups dataset, kdd. ics. uci. edu/databases
The 4 universities data set, www. cs. cmu. edu/afs/cs. cmu. edu/ project/theo-20/www/data/

Index Terms

Computer Science

Information Sciences

Keywords

Text mining Document clustering Cosine similarity k-means