International Conference on Quality Up-gradation in Engineering, Science and Technology |
Foundation of Computer Science USA |
ICQUEST - Number 1 |
October 2014 |
Authors: Bhagyashree Umale, Nilav M. |
21c93e22-5ce9-4566-bd2c-9e8be736c84e |
Bhagyashree Umale, Nilav M. . Overview of K-means and Expectation Maximization Algorithm for Document Clustering. International Conference on Quality Up-gradation in Engineering, Science and Technology. ICQUEST, 1 (October 2014), 5-8.
Advances in data collection and storage capabilities during the past decades have led to an information overload in most sciences. Computer forensics is a new and fast growing field that involves carefully collecting and examining electronic evidence that not only assesses the damage to a computer as a result of an electronic attack, but also to recover lost information from such a system to prosecute a criminal. Nowadays the digital content involved in a crime is nowhere simple to read & infer. Its increasingly a labyrinth of data/files/folders, which needs to be analyzed, to get ahead into investigation & solving the crime cases worldwide. In light of this, the computer based document clustering, for the forensics analysis of digital content/data, is a very important tool/program. It reduces the much of manual effort & redundancy, & makes the resolution of crimes cases faster. The process of clustering is based on processing of multiple text files simultaneously. These text files may comprise very huge raw/text data, which needs to be converted into structured form in order to do further processing of crime analysis. Huge volumes of data need to be analyzed & this process may be slow if commercial and open source forensic tools are used. In early days, forensics was largely performed by computer proffesionals who worked with law enforcement on an ad-hoc, case-by-case basis. There are many algorithms suggested by various experts for the data analysis. A study of investigation work over the different document clustering methods for forensic analysis is used for this survey. In this paper, we are aiming to explain partitional algorithms namely – kmeans and its variant i. e. , Expectation Maximization Algorithm.