CFP last date
20 January 2025
Reseach Article

Parallel K-Means Clustering for Gene Expression Data on SNOW

by Briti Deb, Satish Narayana Srirama
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 71 - Number 24
Year of Publication: 2013
Authors: Briti Deb, Satish Narayana Srirama
10.5120/12691-9486

Briti Deb, Satish Narayana Srirama . Parallel K-Means Clustering for Gene Expression Data on SNOW. International Journal of Computer Applications. 71, 24 ( June 2013), 26-30. DOI=10.5120/12691-9486

@article{ 10.5120/12691-9486,
author = { Briti Deb, Satish Narayana Srirama },
title = { Parallel K-Means Clustering for Gene Expression Data on SNOW },
journal = { International Journal of Computer Applications },
issue_date = { June 2013 },
volume = { 71 },
number = { 24 },
month = { June },
year = { 2013 },
issn = { 0975-8887 },
pages = { 26-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume71/number24/12691-9486/ },
doi = { 10.5120/12691-9486 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:36:32.968805+05:30
%A Briti Deb
%A Satish Narayana Srirama
%T Parallel K-Means Clustering for Gene Expression Data on SNOW
%J International Journal of Computer Applications
%@ 0975-8887
%V 71
%N 24
%P 26-30
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The exponential growth in the amount of data brings in new challenges for data analysis. Gene expression dataset is one such type of data necessitating analytical methods to mine patterns implicit in it. Although clustering has been a popular way to analyze such dataset, the increase in size of dataset necessitates the need for improving the efficiency of clustering methods. In this paper, we study the use of using Principal Components (PCs) as a pre-processing step to provide a more efficient data structure to a parallel formulation of the sequential K-Means algorithm, utilizing multiple cores available in a desktop computer, via the Simple Network of Workstations (SNOW) package. Initial result suggests that SNOW package provides an intuitive way for biologists to parallelize algorithms and speedup job execution, particularly for jobs like K-Means clustering which depends on random starting centroid locations.

References
  1. Dhillon, I. S. and Modha, D. S. , A data-clustering algorithm on distributed memory multiprocessors. In Large Scale Parallel Data Mining, Lecture Notes in Computer Science, 1759:245–260, Mar. 2000.
  2. Judd, D. , McKinley, P. K. and Jain, A. K. , Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871–876, Aug. 1998.
  3. Zhao, Weizhong, Huifang Ma, and Qing He. "Parallel k-means clustering based on mapreduce. " Cloud Computing. Springer Berlin Heidelberg, 2009. 674-679.
  4. Pettinger, D. and Giuseppe D. F. , "Scalability of efficient parallel K-Means. " E-Science Workshops, 2009 5th IEEE International Conference on. IEEE, 2009.
  5. McCallum MC. , Weston, S. , Parallel R, Orielly publications, 2011
  6. Darema, F. , SPMD model: past, present and future, Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users' Group Meeting, Santorini/Thera, Greece, September 23–26, 2001. LNCS 2131, p. 1, 2001.
  7. Jain, A. K. , and Richard C. D. , Algorithms for clustering data. Prentice-Hall, Inc. , 1988.
  8. Dasgupta, S. , The hardness of k-means clustering. Department of Computer Science and Engineering, University of California, San Diego, 2008.
  9. Srirama, S. N. , Batrashev, O. and Vainikko. E. , "SciCloud: scientific computing on the cloud. " Proc of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Comp. Soc. , 2010.
  10. Tierney, L. , et al. "Snow: simple network of workstations. " R package version 0. 3-3, URL http://CRAN. R-project. org/package= snow (2008).
  11. Johanson, K. , et al. "Saccharomyces cerevisiae gene expression changes during rotating wall vessel suspension culture. " Journal of Applied Physiology 93. 6 (2002): 2171-2180.
  12. Ding, C. and Xiaofeng H. , "K-means clustering via principal component analysis. " Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
  13. Cattell, R. B. , "The scree test for the number of factors. " Multivariate behavioral research 1. 2 (1966): 245-276.
  14. Adler P. , et al, Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biol. 2009;10:R139. Data: http://biit. cs. ut. ee/mem/training/
  15. Bolshakova, N. , and Azuaje, F. , "Improving expression data mining through cluster validation. " Information Technology Applications in Biomedicine, 2003. 4th International IEEE EMBS Special Topic Conference on. IEEE, 2003.
Index Terms

Computer Science
Information Sciences

Keywords

SNOW Parallel K-Means Clustering Scalability Testing