CFP last date
20 January 2025
Reseach Article

A Simple Yet Fast Clustering Approach for Categorical Data

by Garima Khandelwal, Rakesh Sharma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 120 - Number 17
Year of Publication: 2015
Authors: Garima Khandelwal, Rakesh Sharma
10.5120/21321-4341

Garima Khandelwal, Rakesh Sharma . A Simple Yet Fast Clustering Approach for Categorical Data. International Journal of Computer Applications. 120, 17 ( June 2015), 25-30. DOI=10.5120/21321-4341

@article{ 10.5120/21321-4341,
author = { Garima Khandelwal, Rakesh Sharma },
title = { A Simple Yet Fast Clustering Approach for Categorical Data },
journal = { International Journal of Computer Applications },
issue_date = { June 2015 },
volume = { 120 },
number = { 17 },
month = { June },
year = { 2015 },
issn = { 0975-8887 },
pages = { 25-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume120/number17/21321-4341/ },
doi = { 10.5120/21321-4341 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:06:29.772011+05:30
%A Garima Khandelwal
%A Rakesh Sharma
%T A Simple Yet Fast Clustering Approach for Categorical Data
%J International Journal of Computer Applications
%@ 0975-8887
%V 120
%N 17
%P 25-30
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Categorical data has always posed a challenge in data analysis through clustering. With the increasing awareness about Big data analysis, the need for better clustering methods for categorical data and mixed data has arisen. The prevailing clustering algorithms are not suitable for clustering categorical data majorly because the distance functions used for continuous data are not applicable for categorical data. Recent research focuses on several different approaches for clustering categorical data. However, the complexity of methods makes them unsuitable for use in big data. Emphasis should be on algorithms which are faster. Thus paper proposes a simple, fast method derived from statistics for clustering categorical data. Results on popular datasets are encouraging.

References
  1. E. W. Forgy (1965). "Cluster analysis of multivariate data: efficiency versus interpretability of classifications". Biometrics 21: 768–769.
  2. J. A. Hartigan (1975). Clustering algorithms. John Wiley & Sons, Inc.
  3. Hartigan, J. A. ; Wong, M. A. (1979). "Algorithm AS 136: A K-Means Clustering Algorithm". Journal of the Royal Statistical Society, Series C 28 (1): 100–108. JSTOR 2346830.
  4. M. J. Zaki and M. Peters, "Click: Mining subspace clusters in categorical data via k-partite maximal cliques," in Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005, pp. 355–356.
  5. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2, 283–304.
  6. GIBSON, D. , KLEINBERG, J. , and RAGHAVAN, P. 1998. Clustering categorical data: An approach based on dynamic systems. In Proceedings of the 24th International Conference on Very Large Databases, 311-323, New York, NY.
  7. Y. Zhang, A. Fu, C. Cai, and P. Heng. Clustering categorical data. In Proceedings of the ICDE, page 305, 2000.
  8. D. Barbar´a, Y. Li, and J. Couto, "Coolcat: an entropy-based algorithm for categorical clustering," in Proceedings of the eleventh international conference on Information and knowledge management. ACM, 2002, pp. 582–589.
  9. D. Cristofor and D. Simovici. An information-theoretical approach to clustering categorical databases using genetic algorithms. In 2nd SIAM ICDM, Workshop on clustering high dimensional data, 2002.
  10. Guha, S. , Rastogi, R. , & Shim, K. (1999). Rock: a robust clustering algorithm for categorical attributes. In Proceedings of the 15th international conference on data engineering, 23–26 March 1999, Sydney, Austrialia (pp. 512–521). IEEE Computer Society.
  11. GANTI, V. , GEHRKE, J. and RAMAKRISHNAN, R. 1999a. CACTUS-Clustering Categorical Data Using Summaries. In Proceedings of the 5th ACM SIGKDD, 73-83, San Diego, CA.
  12. P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevick, "Limbo: Scalable clustering of categorical data," in Advances in Database Technology-EDBT 2004. Springer, 2004, pp. 123–146.
  13. Ahmad, A. , & Dey, L. (2011). A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets. Pattern Recognition Letters, 32, 1062–1069.
  14. D. Ienco, Ruggero G. Pensa and R. Meo, "Context-Based Distance Learning for Categorical Data Clustering," Advances in Intelligent Data Analysis VIII Lecture Notes in Computer Science Volume 5772, 2009, pp 83-94.
  15. W. A. Hassanein and Amr A. Elmelegy, "clustering algorithms for Categorical data using concepts of Significance and dependence of Attributes", European Scientific Journal January 2014 edition vol. 10, No. 3, pp 381-400.
  16. UCI Machine Learning Repository, http://ics. uci. edu/ mlearn/MLRepository. html
  17. Wu, S. , Jiang, Q. , & Huang, J. Z. (2007). A new initialization method for clustering categorical data. In Proceedings of the 11th Pacific-Asia conference on advances in knowledge discovery and data mining PAKDD'07 (pp. 972–980). Berlin, Heidelberg: Springer-Verlag.
  18. Cao, F. , Liang, J. , & Bai, L. (2009). A new initialization method for categorical data clustering. Expert Systems and Applications, 36, 10223–10228.
  19. S S Khan and A Ahmad, "Cluster center initialization algorithm for K-modes clustering", Expert Systems with Applications 40 (2013) 7444–7456.
  20. Y Cheung and H Jia, "Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number", Pattern Recognition 46 (2013) 2228–2238. Available at http://dx. doi. org/10. 1016/j. patcog. 2013. 01. 027
  21. C. Li, G. Biswas, Unsupervised learning with mixed numeric and nominal data, IEEE Transactions on Knowledge and Data Engineering 14 (4) (2002) 673–690.
  22. J. Z. Huang, M. K. Ng, H. Rong, Z. Li, Automated variable weighting in k-mean type clustering, IEEE Transactions on PAMI 27 (5) (2005).
  23. Y. Reich, S. J. Fenves, The formation and use of abstract concepts in design, in: D. H. Fisher, M. J. Pazzani, P. Langley (Eds. ), Concept Formation: Knowledge and Experience in Unsupervised Learning, Morgan Kaufman, Los Altos, Calif, 1991, pp. 323–352.
Index Terms

Computer Science
Information Sciences

Keywords

Clustering categorical data big data k-means