International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 183 - Number 11 |
Year of Publication: 2021 |
Authors: Noor Basha, Ashok Kumar P.S. |
10.5120/ijca2021921415 |
Noor Basha, Ashok Kumar P.S. . Distance-based K-Means Clustering Algorithm for Anomaly Detection in Categorical Datasets. International Journal of Computer Applications. 183, 11 ( Jun 2021), 9-14. DOI=10.5120/ijca2021921415
Real-world data sets also provide knowledge in an unsupervised manner with distinct and complementary aspects. In the field of cluster analysis, a number of algorithms have recently arisen. A priori, it is difficult for a user to determine which algorithm will be most suitable for a given dataset. For this job, algorithms based on graphs give good results. Such algorithms are however, vulnerable to outliers and noises with minimal edge information found in the tree to split a dataset. Thus, in several fields, the need for better clustering algorithms increases and for this reason utilizing robust and dynamic algorithms to improve and simplify the whole process of data clustering has become an important research field. In this paper, a novel distance-based clustering algorithm called the entropic distance based K-means clustering algorithm (EDBK) is proposed to remove the outliers in effective way. This algorithm depends on the entropic distance between attributes of data points and some basic mathematical statistics operations. In this work, experiments are conducted using UCI datasets showed that EDBK method outperforms the existing methods such as Artificial Bee Colony (ABC), k-means etc. The EDBK achieved 80.71% recall, 79.81% precision and 75.82% F-measure. The results show that the EDBK method not only improve the clustering accuracy (i.e. nearly 92%), but also greatly reduce the interference of outliers to clustering results.