CFP last date
20 December 2024
Reseach Article

Article:Comparing K-Value Estimation for Categorical and Numeric Data Clustring

by K.Arunprabha, V.Bhuvaneswari
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 11 - Number 3
Year of Publication: 2010
Authors: K.Arunprabha, V.Bhuvaneswari
10.5120/1565-1875

K.Arunprabha, V.Bhuvaneswari . Article:Comparing K-Value Estimation for Categorical and Numeric Data Clustring. International Journal of Computer Applications. 11, 3 ( December 2010), 4-7. DOI=10.5120/1565-1875

@article{ 10.5120/1565-1875,
author = { K.Arunprabha, V.Bhuvaneswari },
title = { Article:Comparing K-Value Estimation for Categorical and Numeric Data Clustring },
journal = { International Journal of Computer Applications },
issue_date = { December 2010 },
volume = { 11 },
number = { 3 },
month = { December },
year = { 2010 },
issn = { 0975-8887 },
pages = { 4-7 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume11/number3/1565-1875/ },
doi = { 10.5120/1565-1875 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:59:38.580263+05:30
%A K.Arunprabha
%A V.Bhuvaneswari
%T Article:Comparing K-Value Estimation for Categorical and Numeric Data Clustring
%J International Journal of Computer Applications
%@ 0975-8887
%V 11
%N 3
%P 4-7
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In Data mining, Clustering is one of the major tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. We present an improved algorithm for learning k while clustering the Categorical clustering. We present a clustering algorithm Gaussian means applied in k-means paradigm that works well for categorical features. For applying Categorical dataset to this algorithm, converting it into numeric dataset. In this paper we present a Heuristic novel techniques are used for conversion and comparing the categorical data with numeric data. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs in k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. G-means only requires one intuitive parameter, the standard statistical significance level α.

References
  1. “Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions”,START,Vol .10,No.5.
  2. Ahmed M. Sultan Hala Mahmoud Khaleel., ”A new modified Goodness of fit tests for type 2 censored sample from Normal population“
  3. Blake. C.L. and Merz. C.J. “ UCI repository of machine learning databases”,1998.
  4. Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. “Adaptive dimension reduction for clustering high dimensional data”.In Proceedings of the 2nd IEEE International Conference on Data Mining, 2002.
  5. Dongmin Cai, and Stephen S-T Yau, ”Categorical Clustering By Converting Associated Information” International Journal of Computer Science 1;1 2006.
  6. Greg Hamerly,Charles Elkan, “Learning the k in k means”
  7. Gregory James Hamerly,”Learning structure and concepts in data through data clustering”. 2001.
  8. Jain,A.K., Murty. M. N., and Flynn. P. J. “Data clustering: a review”. ACM Computing Surveys, 1999.
  9. Stephens. M.A. “EDF statistics for goodness of fit and some comparisons”. American Statistical Association, September 1974.
  10. Zhang. Y. , Fu. A, Cai. C. and Heng. P., “Clustering categorical data” 2000
  11. Zhexue Huang, ”Extensions to the K-means algorithm for clustering Large Data sets with categorical value”, 1998.
Index Terms

Computer Science
Information Sciences

Keywords

Data mining Clustering Algorithm Categorical data Gaussian Distribution