CFP last date
20 January 2025
Reseach Article

Improved Cluster Partition in Principal Component Analysis Guided Clustering

by S. M. Shaharudin, N. Ahmad, F. Yusof
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 75 - Number 11
Year of Publication: 2013
Authors: S. M. Shaharudin, N. Ahmad, F. Yusof
10.5120/13156-0839

S. M. Shaharudin, N. Ahmad, F. Yusof . Improved Cluster Partition in Principal Component Analysis Guided Clustering. International Journal of Computer Applications. 75, 11 ( August 2013), 22-25. DOI=10.5120/13156-0839

@article{ 10.5120/13156-0839,
author = { S. M. Shaharudin, N. Ahmad, F. Yusof },
title = { Improved Cluster Partition in Principal Component Analysis Guided Clustering },
journal = { International Journal of Computer Applications },
issue_date = { August 2013 },
volume = { 75 },
number = { 11 },
month = { August },
year = { 2013 },
issn = { 0975-8887 },
pages = { 22-25 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume75/number11/13156-0839/ },
doi = { 10.5120/13156-0839 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:44:01.266555+05:30
%A S. M. Shaharudin
%A N. Ahmad
%A F. Yusof
%T Improved Cluster Partition in Principal Component Analysis Guided Clustering
%J International Journal of Computer Applications
%@ 0975-8887
%V 75
%N 11
%P 22-25
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Principal component analysis (PCA) guided clustering approach is widely used in high dimensional data to improve the efficiency of K- means cluster solutions. Typically, Pearson correlation is used in PCA to provide an eigen-analysis to obtain the associated components that account for most of the variations in the data. However, PCA based Pearson correlation can be sensitive on non-Gaussian distributed data, which involve skewed observations such as outlying values. Thus, applying PCA based Pearson correlation on such data could affect cluster partitions and generate extremely imbalanced clusters in a high dimensional space. In this study, Tukey's biweight correlation based on M-estimate approach in PCA is used as an alternative to Pearson correlation. This approach is more resistant to outlying values as it examines each observation and down weight those that lie far from the center of the data. In particular two major features are highlighted: (1) fewer components are retained and imbalanced clusters at the recommended cumulative percentage of variation threshold is avoided; (2) the cluster quality with respect to external, internal and relative criteria as shown in Rand, Silhouette and Davies-Bouldin indices, outperform that of the clusters from PCA based Pearson correlation.

References
  1. Marghny, M. H. , Abd El-Aziz, R. M. , Taloba, A. I. 2011. An Effective Evolutionary Clustering Algorithm: Hepatitis C Case Study. International Journal of Computer Applications. Vol 34-No. 6.
  2. Indhumathi, R. and Sathiyabama, S. 2010. Reducing and Clustering High Dimensional Data Through Principal Component Analysis. International Journal of Computer Applications. Vol 11-No. 8.
  3. Ding, C. and Xiaofeng, H. 2004. K-means Clustering via Principal Component Analysis. In proceedings of the 21st International Conference on Machine Learning, Canada.
  4. Kendall, M. G. and Stuart, A. 1958. The Advanced Theory of Statistics. New York. .
  5. Everitt, B. S. and Dunn, G. 2001. Applied Multivariate Data Analysis. London: Arnold Publisher.
  6. Neware, S. , Mehta, K. and Zadgaonkar, A. S. 2013. Finger Knuckle Identification using Principal Component Analysis and Nearest Mean Classifier. International Journal of Computer Applications. Vol 70-No. 9.
  7. Jolliffe, I. T. 2002. Principal Component Analysis (2nd ed. ). New York,Inc. : Springer-Verlag.
  8. Penarrocha, D. , Estrela, M. J. , and Millan, M. 2002. Classification of Daily Rainfall Patterns in a Mediterranean Area with Extreme Intensity Levels: The Valencia Region. Internation Journal of Climatology, Vol 22, 677-695.
  9. Romero, R. , Ramis, C. , and Guijarro, J. A. 1999. Daily Rainfall Patterns in the Spanish Mediterranean Area: An Objective Classification. International Journal of Climatology. Vol 19, 95-112.
  10. Sumner, G. , Guijarro. J. A. , and Ramis, C. 1995. The Impact of Surface Circulations on The Daily Rainfall Over Mallorca. International Journal of Climatology. Vol 15, 673–696.
  11. Wickramagamage, P. 2010. Seasonality and spatial pattern of rainfall of Sri Lanka: Exploratory factor analysis. International Journal of Climatology. Vol 30, 1235-1245.
  12. Hardin, J. , Mitani. A. , Hicks. L. and Vankoten. B. 2007. A Robust Measure of Correlation Between Two Genes on A Microarray. BMC Bioinformatics. Vol 8, 220.
  13. Rousseeuw, P. J. and Leroy, A. M. 2003. Robust Regression and Outlier Detection. New Jersey: John Wiley & Sons, Inc.
  14. Owen, M. 2010. Tukey's Biweight Correlation and the Breakdown. Thesis. Pomona College.
  15. Choulakian, V. 2001. Robust Q-Mode Principal Component Analysis in L1. Computational Statistics & Data Analysis. Vol 37, 135-150.
  16. Maulik, U. 2002. Performance Evaluation of SomeClustering Algorithms and Validity Indices. IEEE Transactions of Pattern Analysis and Machine Intelligence. Vol. 24, No. 12.
  17. Cui, K. 2012. Semiparametric Gaussian Variance-Mean Mixtures for Heavy-Tailed and Skewed Data. ISRN Probability and Statistics, vol. 2012, Article ID 345784, 18 pages, 2012. doi:10. 5402/2012/345784
  18. Mimmack, G. M. , Mason S. J. and Galpin, J. S. 2002. Choice of Distance Matrices In Cluster Analysis : Defining Regions. Journal of Climate. Vol 14, 2790-2797.
  19. Everitt, B. S. , Landau, S. and Leese, M. 2001. Cluster Analysis. London: Arnold Publisher.
Index Terms

Computer Science
Information Sciences

Keywords

Tukey's biweight K-means Principal Component Analysis.