International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 62 - Number 19 |
Year of Publication: 2013 |
Authors: Vikas Kumar Sihag, Subhash Kumar |
10.5120/10185-5005 |
Vikas Kumar Sihag, Subhash Kumar . Graph based Text Document Clustering by Detecting Initial Centroids for k-Means. International Journal of Computer Applications. 62, 19 ( January 2013), 1-4. DOI=10.5120/10185-5005
Document clustering is used in information retrieval to organize a large collection of text documents into some meaningful clusters. k-means clustering algorithm of pratitional category, performs well on document clustering. k-means organizes a large collection of items into k clusters so that a criterion function is optimized. As it is sensitive to the initial values of cluster centroids, this paper proposes a graph based method to calculate the appropriate initial cluster centroids. Document collection is represented as a graphical network in which a node represents a document and an edge represents the similarity between two documents. In order to calculate initial centroids, community structure present in graphical network is detected using edge deletion technique. Using community structure, centrality of each node is calculated. Centrality value of a node represents its candidature of being a cluster centroid. Use of community structure assures that calculated centroids have sufficient number of topically related documents and centroids are well separated from each other. k-means with these initial centroids provides a significant improvement over simple k-means for text document clustering.