Overview of K-means and Expectation Maximization Algorithm for Document Clustering

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Overview of K-means and Expectation Maximization Algorithm for Document Clustering

Published on October 2014 by Bhagyashree Umale, Nilav M.

International Conference on Quality Up-gradation in Engineering, Science and Technology

Foundation of Computer Science USA

ICQUEST - Number 1

October 2014

Authors: Bhagyashree Umale, Nilav M.

Bhagyashree Umale, Nilav M. . Overview of K-means and Expectation Maximization Algorithm for Document Clustering. International Conference on Quality Up-gradation in Engineering, Science and Technology. ICQUEST, 1 (October 2014), 5-8.

@article{

author = { Bhagyashree Umale, Nilav M. },

title = { Overview of K-means and Expectation Maximization Algorithm for Document Clustering },

journal = { International Conference on Quality Up-gradation in Engineering, Science and Technology },

issue_date = { October 2014 },

volume = { ICQUEST },

number = { 1 },

month = { October },

year = { 2014 },

issn = 0975-8887,

pages = { 5-8 },

numpages = 4,

url = { /proceedings/icquest/number1/18683-1510/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 International Conference on Quality Up-gradation in Engineering, Science and Technology

%A Bhagyashree Umale

%A Nilav M.

%T Overview of K-means and Expectation Maximization Algorithm for Document Clustering

%J International Conference on Quality Up-gradation in Engineering, Science and Technology

%@ 0975-8887

%V ICQUEST

%N 1

%P 5-8

%D 2014

%I International Journal of Computer Applications

Abstract

Advances in data collection and storage capabilities during the past decades have led to an information overload in most sciences. Computer forensics is a new and fast growing field that involves carefully collecting and examining electronic evidence that not only assesses the damage to a computer as a result of an electronic attack, but also to recover lost information from such a system to prosecute a criminal. Nowadays the digital content involved in a crime is nowhere simple to read & infer. Its increasingly a labyrinth of data/files/folders, which needs to be analyzed, to get ahead into investigation & solving the crime cases worldwide. In light of this, the computer based document clustering, for the forensics analysis of digital content/data, is a very important tool/program. It reduces the much of manual effort & redundancy, & makes the resolution of crimes cases faster. The process of clustering is based on processing of multiple text files simultaneously. These text files may comprise very huge raw/text data, which needs to be converted into structured form in order to do further processing of crime analysis. Huge volumes of data need to be analyzed & this process may be slow if commercial and open source forensic tools are used. In early days, forensics was largely performed by computer proffesionals who worked with law enforcement on an ad-hoc, case-by-case basis. There are many algorithms suggested by various experts for the data analysis. A study of investigation work over the different document clustering methods for forensic analysis is used for this survey. In this paper, we are aiming to explain partitional algorithms namely – kmeans and its variant i. e. , Expectation Maximization Algorithm.

References

Luís Filipe da Cruz Nassif and Eduardo Raul Hruschka, "Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection", IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013.
Meenakshi PC, Meenu S, Mithra M, Leela Rani P, "Fault Prediction using Quad Tree and Expectation Maximization Algorithm", International Journal of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of Computer Science FCS, New York, USA Volume 2– No. 4, May 2012
J. F. Gantz, D. Reinsel, C. Chute, W. Schlichting, J. McArthur, S. Minton, I. Xheneti, A. Toncheva, and A. Manfrediz, "The expanding digital universce: A forecast of worldwide information growth through 2010," Inf. Data, vol. 1, pp. 1–21, 2007.
B. S. Everitt, S. Landau, and M. Leese, "Cluster Analysis". London, U. K. : Arnold, 2001.
A. K. Jain and R. C. Dubes, "Algorithms for Clustering Data. "Englewood Cliffs, NJ: Prentice-Hall, 1988.
L. Kaufman and P. Rousseeuw, "Finding Groups in Gata: An Introduction to Cluster Analysis". Hoboken, NJ: Wiley-Interscience, 1990.
A. Strehl and J. Ghosh, "Cluster ensembles: A knowledge reuse framework for combining multiple partitions," J. Mach. Learning Res. , vol. 3, pp. 583–617, 2002.
E. R. Hruschka, R. J. G. B. Campello, and L. N. de Castro, "Evolving clusters in gene-expression data," Inf. Sci. , vol. 176, pp. 1898–1927, 2006.
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka, "Relative clustering validity criteria: A comparative overview," Statist. Anal. Data Mining, vol. 3, pp. 209–235, 2010.
G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Inf. Process. Manage. , vol. 24, no. 5, pp. 513–523, 1988.
B. D. Carrier and E. H. Spafford. , " An event-based digital forensic investigation framework". In Proceedings of the 4th Digital Forensic Research Workshop, 2004.
M. Laszlo and S. Mukherjee, "A Genetic Algorithm Using Hyper-Quad trees for Low-Dimensional K-Means Clustering," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no4, pp. 533-543, 2006.
Han, Kamber, Pei, "Data Mining : Concepts and Techniques", MK Third Edition
C M Bishop, "Pattern Recognition and Machine Learning" NewYork Springer-Verlag 2006
M. Steinbach, G. Karypis, and V. Kumar. "A comparison of document clustering techniques". Technical Report 00-034, University of Minnesota, 2000

Index Terms

Computer Science

Information Sciences

Keywords

Computer Forensics Analysis Expectation-maximization K-means.