A Cluster based Approach with N-grams at Word Level for Document Classification

Apeksha Khabia; M. B. Chandak

Call for Paper

July Edition

IJCA solicits high quality original research papers for the upcoming July edition of the journal. The last date of research paper submission is 20 June 2025

Submit your paper

Know more

The week's pick

Designing Multi-Tenant E-Learning Systems in the Cloud: A Process-Oriented Approach for Higher Education

Sameh Azouzi Sonia Ayachi Ghannouchi

Random Articles

Analysis of Approaches to Short Term Passenger Volume Prediction in Public Transport

December

2015

Encryption of Compressed MultiMedia Data

December

2012

AM FM based Prediction of Multiple Sclerosis in Brain MRI Images

September

2014

Fuzzy Quality Control with Reliability and Flexibility

August

2013

Reseach Article

A Cluster based Approach with N-grams at Word Level for Document Classification

by Apeksha Khabia, M. B. Chandak

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 117 - Number 23

Year of Publication: 2015

Authors: Apeksha Khabia, M. B. Chandak

10.5120/20697-3599

Apeksha Khabia, M. B. Chandak . A Cluster based Approach with N-grams at Word Level for Document Classification. International Journal of Computer Applications. 117, 23 ( May 2015), 38-42. DOI=10.5120/20697-3599

@article{ 10.5120/20697-3599,

author = { Apeksha Khabia, M. B. Chandak },

title = { A Cluster based Approach with N-grams at Word Level for Document Classification },

journal = { International Journal of Computer Applications },

issue_date = { May 2015 },

volume = { 117 },

number = { 23 },

month = { May },

year = { 2015 },

issn = { 0975-8887 },

pages = { 38-42 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume117/number23/20697-3599/ },

doi = { 10.5120/20697-3599 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:00:13.255808+05:30

%A Apeksha Khabia

%A M. B. Chandak

%T A Cluster based Approach with N-grams at Word Level for Document Classification

%J International Journal of Computer Applications

%@ 0975-8887

%V 117

%N 23

%P 38-42

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

A breakneck progress of computers and web makes it easier to collect and store large amount of information in the form of text; e. g. , reviews, forum postings, blogs, web pages, news articles, email messages. In text mining, growing size of text datasets and high dimensionality associated with natural language is great challenge which makes it difficult to classify documents in various categories and sub-categories. This paper focuses on cluster based document classification technique so that data inside each cluster shares some common trait. The common approach for document clustering problem is bag of words model (BOW), where words are considered as features. But some semantic information is always lost as only words are considered. Thus we aim at using vector-space model based on N-grams at word level which helps to reduce the loss of semantic information. The problem of high dimensionality is solved with feature selection technique by applying threshold on feature values of vector space model. The vector space is mapped into a modified one with latent semantic analysis (LSA). Clustering of documents is done using k-means algorithm. Experiments are performed on Stack Exchange data set of some categories. R is used as text mining tool for implementation purpose. Experiment results show that tri-grams give better clustering results than words and bi-grams.

References

Khabia A. , Chandak M. B. , "A Cluster Based Approach for Classification of Web Results", International Journal of Advaanced Computer Research, December 2014. Vol. 4, No. 4, Issue 17.
Salton G. , Buckley C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing and Management. Vol. 24, No. 5, Pages 513–523.
Agrawal C. C. , Zhai C. 2012. A Survey of Text Clustering Algorithms. In:Mining Text Data. Springer US. ISBN: 978-1-4614-3222-7 (Print) 978-1-4614-3223-4 (Online).
Canvar W. B. 1994. Using an n-gram-based document representation with a vector processing retrieval model. In TREC. Pages 269–278.
Tan C. , Wang, Y. , and Lee, C. , "The use of bigrams to enhance text categorization", Journal of Information Processing and Management, 2002.
Wang S. I. , Manning, C. D. 2012. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification. In Proceedings of ACL.
Lin D. , Wu, X. 2009. Phrase clustering for discriminative learning. In Proceedings of ACL.
I. K. Fodor. 2002. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494. Center for Applied Scientific Computing. Lawrence Livermore National Laboratory.
Y. Yang, J. O. Pedersen. 1997. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML. 14th International Conference on Machine Learning. Pages 412–420. Nashville, US.
Wild F. , Stahl C. 2006. Investigating Unstructured Texts with Latent Semantic Analysis. In Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation e. V. Springer. Berlin Heidelberg.
Owen S. , Anil R. , Dunning T. , Friedman E. 2012. Real-world applications of clustering. In: Mohout In Action. Manning Publications, Shelter Island.
Yingbo M. , Vlado K. , Evangelos M. 2005. Document Clustering using Character Ngrams: A Comparative Evaluation with Termbased and Wordbased Clustering. In the proceedings of the 14th ACM international conference on Information and knowledge management (CIKM). Pages 357-358. ISBN:1-59593-140-6.
Mahdi S. , Singer W. , Roger Z, Evangelos M, Bin T. , Jane T. , Ray S. 2007. Document Representation and Dimension Reduction for Text Clustering. 23rd International Conference on Data Engineering Workshop. IEEE. Pages 770 – 779.
Zho Y. 2012. R and Data Mining: Examples and Case Studies. Elsevier. http://www. rdatamining. com/
Feinerer I. , Hornik K. 2014. Text Mining Package. http://cran. r-project. org/web/packages/tm/vignettes/tm. pdf.
Stewart B. M. 2010. Practical Skills for Document Clustering in R*. http://faculty. washington. edu/jwilker/tft/Stewart. LabHandout. pdf
Landauer T. , Foltz, P. , and Laham, D. 1998. Introduction to Latent Semantic Analysis. In: Discourse Processes 25, Pages 259–284.
http://creativecommons. org/licenses/by-sa/3. 0/legalcode
Tan P. , Steinbach M. , Kumar V. 2006. Introduction to Data Mining. Errata.

Index Terms

Computer Science

Information Sciences

Keywords

Document clustering N-grams at word level dimensionality reduction Latent Semantic Analysis