CFP last date
20 February 2025
Reseach Article

Study and Implementing K-mean Clustering Algorithm on English Text and Techniques to Find the Optimal Value of K

by Sajid Naeem, Aishan Wumaier
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 182 - Number 31
Year of Publication: 2018
Authors: Sajid Naeem, Aishan Wumaier
10.5120/ijca2018918234

Sajid Naeem, Aishan Wumaier . Study and Implementing K-mean Clustering Algorithm on English Text and Techniques to Find the Optimal Value of K. International Journal of Computer Applications. 182, 31 ( Dec 2018), 7-14. DOI=10.5120/ijca2018918234

@article{ 10.5120/ijca2018918234,
author = { Sajid Naeem, Aishan Wumaier },
title = { Study and Implementing K-mean Clustering Algorithm on English Text and Techniques to Find the Optimal Value of K },
journal = { International Journal of Computer Applications },
issue_date = { Dec 2018 },
volume = { 182 },
number = { 31 },
month = { Dec },
year = { 2018 },
issn = { 0975-8887 },
pages = { 7-14 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume182/number31/30224-2018918234/ },
doi = { 10.5120/ijca2018918234 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:12:58.494899+05:30
%A Sajid Naeem
%A Aishan Wumaier
%T Study and Implementing K-mean Clustering Algorithm on English Text and Techniques to Find the Optimal Value of K
%J International Journal of Computer Applications
%@ 0975-8887
%V 182
%N 31
%P 7-14
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In the field of data mining, the approach of assigning a set of items to one similar class called cluster and the process termed as Clustering. Document clustering is one of the rapidly developing, research area for decades and considered a vital task for text mining due to exceptional expansion of document on cyberspace. It provides the opportunity to organize a large amount of scattered text, in meaningful clusters and laydown the foundation for smooth descriptive browsing and navigation systems. One of the more often useable partitioning algorithm is k-means, which is frequently use for text clustering due to its ability of converging to local optimum even though it is for enormous sparse matrix. Its objective is to make the distance of items or data-points belonging to same class as short as possible. This paper, exploring method of how a partitioned (K-mean) clustering works for text document clustering and particularly to explore one of the basic disadvantage of K-mean, which explain the true value of K. The true K value is understandable mostly while automatically selecting the suited value for k is a tough algorithmic problem. The true K exhibits to us how many cluster should make in our dataset but this K is often ambiguous there is no particular answer for this question while many variants for k-means are presented to estimate its value. Beside these variants, range of different probing techniques proposed by multiple researchers to conclude it. The study of this paper will explain how to apply some of these techniques for finding true value of K in a text dataset.

References
  1. Shraddha, S.et al. 2014, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information and Computation Technology.
  2. Y,S,Patail , M.B. Vaidya 2012, “A Technical survey on Clustering Analysis in Data mining” International Journal of Emerging Technology and Advanced Engineering.
  3. Himanshu Gupta, Dr.Rajeev Srivastav 2014, “K-means Based Document Clustering with Automatic ‘K’ Selection and Cluster Refinement” International Journal of Computer Science and Mobile Applications.
  4. Greg Hamerly and Charles Elkan 2003, “Learning the k in k- means” In Neural Information Processing System, MIT Press.
  5. Anil K Jian 2009, “Data Clustering: 50 Years beyond K-Means, Pattern Recognition Letters”.
  6. Trupti M.Kodinariya and Dr.Prashant R. Makwana 2013, “Review on determining number of cluster in K-Mean Clustering” International Journal of Advance Research in Computer Science and Management Studies.
  7. Ahmad Shafeeq B M, Hareesha K S 2012, “Dynamic Clustering of Data with Modified K-Means Algorithm” International Conference on information and Computer Networks, Vol. 27.
  8. Azhar Rauf, Sheeba, Saeed Mahfooz, Shah Khusro, Huma Javed 2012, “Enhanced K-Mean Clustering Algorithm to Reduce Number of Iterations and Time Complexity” Middle-East Journal of Scientific Research, pp. 959-963.
  9. Youguo Li, Haiyan Wu 2012, “A Clustering Method Based on K-Means Clustering Algorithm” International Conference on Solid State Devices and Materials Science, pp. 1104-1109.
  10. Siddheswar Ray, Rose H.Turi, 1998, “Determination of Number of Clusters in K-Means Clustering and Application in Color Image Segmentation”.
  11. Madhu Yedla, Srinivasa Rao Pathakota, T M Srinivasa 2010, “Enhanced K-Means Clustering Algorithm with Improved Initial Center” International Journal of Computer Science and Information Technologies, Vol. 1, pp. 121-125.
  12. K. A. Abdul Nazeer, M.P. Sebastian July 1-3, 2009, “Improving the Accuracy and Efficiency of the K-Means Clustering Algorithm” Proceedings of the World Congress on Engineering, London, UK.
  13. Madhuri A. Dalal, Nareshkumar D. Harale, Umesh L.Kulkarni July 2011, “An Iterative Improved K-Means Clustering” ACEEE International Journal on Network Security, Vol. 02.
  14. Deepika Khurana, Dr. M.P.S Bhatia May-June 2013, “Dynamic Approach to K-Means Clustering Algorithm” International Journal of Computer Engineering & Technology and Research, Issue 3, Vol. 4, pp. 204-219.
  15. Chunfei Zhang, ZhiyiFang 2013, “An Improved K-Means Clustering Algorithm” Journal of Information & Computational Science.
  16. Nidhi Gupta, R.L. Ujjwal 2013, “An Efficient Incremental Clustering Algorithm” World of Computer Science and Information Technology Journal, Vol. 3.
  17. Pallavi Purohit, Ritesh Joshi March 2013, “A New Efficient Approach towards K-Means Clustering Algorithm” International Journal of the Computer Applications.
  18. Shardda Shukla, Naganna S 2014, “A Review ON K-mean DATA Clustering APPORACH” International Journal of Information and Computation Technology.
  19. D T Pham,S S Dimov, C D Nguyen 2004, “Selection of K in K-means clustering” Manufacturing Engineering Centre, Cardiff University, Cardiff, UK.
  20. Greg Hamerly Charles Elkan 2004, “Learning the K in K-means” Advances in neural information processing systems, Vol. 16.
  21. Jian Di Xinyue Gou 2017, “Bisecting K-means Algorithm Based on K-valued Self-determining and Clustering Center Optimization” Journal of Computers.
  22. Smyth, P. 1996, “Clustering using Monte Carlo Cross Validation” In Proc.2nd Intl. Conf. Knowl. Discovery and Data Mining (KDD-96), Portland.
  23. J. B. MacQueen 1967, "Some Methods for classification and Analysis of Multivariate Observations” Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press.
  24. Ville Satopa et al. “Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior” International Computer Science Institute, Berkeley, CA.
  25. Wei Fu and Patrick O. Perry February 10, 2017, “Estimating the number of clusters using cross-validation” Stern School of Business, New York University.
  26. Moh'd Belal Al- Zoubi and Mohammad al Rawi, “An Efficient Approach for Computing Silhouette Coefficients” Department of Computer Information Systems, University of Jordan, Amman 11942, Jordan.
  27. Tippaya Thinsungnoena et al 2015, “The Clustering Validity with Silhouette and Sum of Squared Errors” The 3rd International Conference on Industrial Application Engineering (ICIAE2015).
  28. Fazli Can and E. A.Ozkarahan, December 1990, “Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases” ACM Transactions on Database Systems, Vol. 15, No. 4, pp. 483-517.
  29. Robert Tibshirani et al. 2001, “Estimating the number of cluster in a dataset via the gap” Royal Statistical Society, Standford University, USA. Part 2, pp. 411-423.
  30. Schwarz, Gideon E. March 1978, "Estimating the dimension of a model". Annals of Statistics Vol. 6, No. 2, pp. 461–464.
  31. D. Pelleg and A. Moore July 2000, “X-means: Extending k-means with efficient estimation of the number of clusters” Proceedings of the Seventeenth International Conference on Machine Learning, pp 727-734
  32. C. Fraley and A. E. Raftery 1998, “How Many Clusters? Which Clustering Method? Answers via Model-Based Cluster Analysis” The Computer Journal, Department of Statistics University of Washington USA, Vol. 41, No. 8, pp. 578-588.
  33. R. E. Kass and L. Wasserman 1995, “A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion,” Journal of the American Statistical Association, pp. 928– 934.
  34. Chun-ling Chen,S.C. Tseng and Tyne Liang Nov. 2010, “An integration of Word Net and Fuzzy association rule mining for multi-label document clustering” Data and Knowledge Engineering, pp. 1208-1226.
  35. J.T. Tou and R.C. Gonzalez 1974, “Pattern Recognition Principles” Massachusetts: Addison-Wesley.
  36. Mehdi Allahyari.et al. August 2017, “A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques” In Proceedings of KDD Bigdas, Halifax, Canada, 13 pages.
  37. Twinkle Svadas, Jasmin Jha June 2015, “Document Cluster Mining on Text Documents” International Journal of Computer Science and Mobile Computing Vol.4, pg.778-782.
  38. Neepa Shah, Sunita Mahajan October 2012, “Document Clustering: A Detailed Review” International Journal of Applied Information Systems (IJAIS) Vol. 4.
  39. Abdennour Mohamed Jalil, Imad Hafidi et al. 2016, “Comparitive Study of Clustering Algorithms in Text Mining Context” International Journal of Interactive Multimedia and Artificial Intelligence Vol. 3, No. 7.
  40. Jonathan J Webster and Chunyu Kit 1992, “Tokenization as the initial phase in NLP” In Proceedings of the 14th conference on Computational linguistics Vol. 4, pp. 1106–1110.
  41. Hassan Saif et al 2014 “On stopwords filtering and data sparsity for sentiment analysis of twitter” School of Engineering and Applied Science, Aston University, UK.
  42. Catarina Silva and Bernardete Ribeiro 2003, “The importance of stop word removal on recall values in text categorization” Proceedings of the International Joint Conference on Neural Networks IEEE, Vol. 3, pp. 1661–1666.
  43. Julie B Lovins 1968, “Development of a stemming algorithm. MIT Information Processing Group” Electronic Systems Laboratory.
  44. Martin F Porter 1980, “An algorithm for suffix stripping” Program: Electronic Library and information system, pp. 130–137.
  45. David A Hull et al. 1996, “Stemming algorithms: A case study for detailed evaluation” JASIS, pp. 70–84.
  46. Everitt, B., 1980. “Cluster Analysis” 2nd Edition. Halsted Press, New York
  47. M. Meila, and D.Hackerman 1998, “An Experimental Comparison of Several Clustering and Initialization Method” Microsoft Research Redmond, WA.
  48. D.L Davies and D.W. Bouldin 1979. “A cluster separation measure” IEEE Trans. Pattern Anal. Machine Intell. Vol.1, pp. 224-227.
  49. The corpus taken from UCI repository “https://archive.ics.uci.edu/ml/datasets.html”.
  50. Sanjoy Dasgupta 2000, “Experiments with random projection” In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference San Francisco, CA. Morgan Kaufmann Publishers, pp. 143-151.
Index Terms

Computer Science
Information Sciences

Keywords

K-Means Clustering Unsupervised Learning Pre-processing