International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 117 - Number 23 |
Year of Publication: 2015 |
Authors: Apeksha Khabia, M. B. Chandak |
10.5120/20697-3599 |
Apeksha Khabia, M. B. Chandak . A Cluster based Approach with N-grams at Word Level for Document Classification. International Journal of Computer Applications. 117, 23 ( May 2015), 38-42. DOI=10.5120/20697-3599
A breakneck progress of computers and web makes it easier to collect and store large amount of information in the form of text; e. g. , reviews, forum postings, blogs, web pages, news articles, email messages. In text mining, growing size of text datasets and high dimensionality associated with natural language is great challenge which makes it difficult to classify documents in various categories and sub-categories. This paper focuses on cluster based document classification technique so that data inside each cluster shares some common trait. The common approach for document clustering problem is bag of words model (BOW), where words are considered as features. But some semantic information is always lost as only words are considered. Thus we aim at using vector-space model based on N-grams at word level which helps to reduce the loss of semantic information. The problem of high dimensionality is solved with feature selection technique by applying threshold on feature values of vector space model. The vector space is mapped into a modified one with latent semantic analysis (LSA). Clustering of documents is done using k-means algorithm. Experiments are performed on Stack Exchange data set of some categories. R is used as text mining tool for implementation purpose. Experiment results show that tri-grams give better clustering results than words and bi-grams.