A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm

Poonam Ghuli; Maanas Prabhakar; Rajashree Shettar

Call for Paper

July Edition

IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 22 June 2026

Submit your paper

Know more

The week's pick

CAD-Genesis: An Open-Source AI-Powered Add-in for Natural Language-Driven Parametric CAD Modeling and Cross-Platform Integration in SolidWorks and Fusion 360

Anil Mandloi Prakhi Mandloi

Random Articles

Multiport Memory Design in Quantum Dot Cellular Automata Platform

Oct

2019

SysRisk ñA Decisional Framework to Measure System Dimensions of Legacy Application for Rejuvenation through Reengineering

February

2011

Open Reviewing for Imparted Information to Effective Client Denial in the Cloud

June

2015

Enabling Effective Personalized Learning: Determinants for Knowledge based Web Information Retrieval Systems

April

2015

Reseach Article

A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm

by Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 125 - Number 5

Year of Publication: 2015

Authors: Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar

10.5120/ijca2015905919

Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar . A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm. International Journal of Computer Applications. 125, 5 ( September 2015), 36-42. DOI=10.5120/ijca2015905919

@article{ 10.5120/ijca2015905919,

author = { Poonam Ghuli, Maanas Prabhakar, Rajashree Shettar },

title = { A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm },

journal = { International Journal of Computer Applications },

issue_date = { September 2015 },

volume = { 125 },

number = { 5 },

month = { September },

year = { 2015 },

issn = { 0975-8887 },

pages = { 36-42 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume125/number5/22432-2015905919/ },

doi = { 10.5120/ijca2015905919 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:15:16.313597+05:30

%A Poonam Ghuli

%A Maanas Prabhakar

%A Rajashree Shettar

%T A Comprehensive Survey on Centroid Selection Strategies for Distributed K-means Clustering Algorithm

%J International Journal of Computer Applications

%@ 0975-8887

%V 125

%N 5

%P 36-42

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Extremely large data sets often known as ‘Big Data’ are analyzed for interesting patterns, trends, and associations, especially those relating to human behavior and interactions. Extraction of meaningful and useful information needs to be done in parallel using advanced clustering algorithms. In this paper, effort has been made to tweak in changes to the existing K-means algorithm so as to work in parallel using MapReduce paradigm. K-means due to its gradient descent nature is highly sensitive to the initial placement of the cluster centers. This random initialization of cluster centers results in empty clusters and slower convergence. In this paper, an overview of existing methods with emphasis on computational efficiency is presented. Comparison of three well known linear time complexity initialization methods has been presented here. These methods are analyzed on two different data sets. The experimental results are recorded and presented with insights on different initialization methods for practitioners.

References

Min Chen, Shiwen Mao, Yunhao Liu, "Big Data: A Survey", Mobile Networks and Applications, Springer publication, Volume 19, (2), April 2014, pp 171-209
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM, Vol 51, no. 1 (2008): 107-113.
Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified Data Processing on Large Clusters." In Sixth Symposium on Operating System Design and Implementation (OSDI 2004), Dec 2004, pp. 137-150.
Mahmud, M. S., Rahman, M. M., & Akhtar, M. N. (2012, December). Improvement of K-means clustering algorithm with better initial centroids based on weighted average. In Electrical & Computer Engineering (ICECE), 2012 7th International Conference on (pp. 647-650). IEEE.
Ding, C., & He, X. (2004, July). K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning (p. 29). ACM.
Ding, C. H., & He, X. (2004, April). Principal Component Analysis and Effective K-Means Clustering. In SDM (pp. 497-501).
D.Napoleon, S.Pavalakodi “A New Method for Dimensionality Reduction using KMeans Clustering Algorithm for High Dimensional Data Set” International Journal of Computer Applications , Vol. 13, (7), January 2011
Nazeer, K., Kumar, S. M., & Sebastian, M. P. (2011, February). Enhancing the k-means clustering algorithm by using a O (n logn) heuristic method for finding better initial centroids. In Emerging Applications of Information Technology (EAIT), 2011 Second International Conference on (pp. 261-264). IEEE.
J. M. Pena, J. A. Lozano, P. Larranaga, 1999, An Empirical Comparison of Four Initialization Methods for the K-Means Algorithm, Pattern Recognition Letters 20 (10) (1999) 1027–1040.
J. He, M. Lan, C. L. Tan, S. Y. Sung, H. B. Low 2004 Initialization of Cluster Refinement Algorithms: A Review and Comparative Study, in: Proc. of the 2004 IEEE Int. Joint Conf. on Neural Networks, pp. 297–302.
A. K. Jain, M. N. Murty, P. J. Flynn 1999 Data Clustering: A Review, ACM Computing Surveys, 31 (3), pp 264–323.
A. K. Jain 2010 Data Clustering: 50 Years Beyond K-means, Pattern Recognition Letters 31 (8) 651–666.
S. Z. Selim, M. A. Ismail, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Trans. on Pattern Analysis and Machine Intelligence 6 (1) (1984) 81–87.
L. Bottou, Y. Bengio, Advances in Neural Information Processing Systems 7, MIT Press, 1995, Ch. Convergence Properties of the K-Means Algorithms, pp. 585–592.
Celebi, M. Emre, Hassan A. Kingravi, and Patricio A. Vela. "A comparative study of efficient initialization methods for the k-means clustering algorithm." Elsevier, Expert Systems with Applications, 40, no. 1 (2013): 200-210.
E. Dede, Z. Fadika, M. Govindaraju, and L. Ramakrishnan, “Benchmarking MapReduce implementations under different application scenarios,” in Future Generation Computer Systems, Special Section: eScience Infrastructure and Applications, Vol. 36. IEEE Computer Society Washington, DC: Elsevier, 2014, pp. 389-399.
S. Sakr, A. Liu, and A. G. Fayoumi. “The family of MapReduce and large-scale data processing systems,” ACM Comput. Surv. (CSUR), Vol. 46, no. 1, Oct. 2013, Article no. 11.
M. Yoon, H.-il Kim, D. H. Choi, H. Jo, and J.-w. Chang, “Performance analysis of MapReduce-based distributed systems for iterative data processing applications,” Mobile Ubiquit. Intell. Comput., Vol. 274, no. 1, pp. 293_299, Mar. 2014.
Ghuli, P., Shukla, A., Kiran, R., Jason, S., & Shettar, R. (2015). Multidimensional Canopy Clustering on Iterative MapReduce Framework Using Elefig Tool. IETE Journal of Research, 61(1), 14-21.

Index Terms

Computer Science

Information Sciences

Keywords

K-means Clustering Algorithm Hadoop MapReduce PCA HDFS.