We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Call for Paper
December Edition
IJCA solicits high quality original research papers for the upcoming December edition of the journal. The last date of research paper submission is 20 November 2024

Submit your paper
Know more
Reseach Article

Document Clustering: A Review

by Sunita Bisht, Amit Paul
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 73 - Number 11
Year of Publication: 2013
Authors: Sunita Bisht, Amit Paul
10.5120/12787-0024

Sunita Bisht, Amit Paul . Document Clustering: A Review. International Journal of Computer Applications. 73, 11 ( July 2013), 26-33. DOI=10.5120/12787-0024

@article{ 10.5120/12787-0024,
author = { Sunita Bisht, Amit Paul },
title = { Document Clustering: A Review },
journal = { International Journal of Computer Applications },
issue_date = { July 2013 },
volume = { 73 },
number = { 11 },
month = { July },
year = { 2013 },
issn = { 0975-8887 },
pages = { 26-33 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume73/number11/12787-0024/ },
doi = { 10.5120/12787-0024 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:39:50.313568+05:30
%A Sunita Bisht
%A Amit Paul
%T Document Clustering: A Review
%J International Journal of Computer Applications
%@ 0975-8887
%V 73
%N 11
%P 26-33
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

As the internet is exploding with huge volume of text documents, the need of grouping similar documents together for versatile applications have hold the attention of researchers in this area. Document clustering can facilitate the tasks of document organization and web browsing, search engine results, corpus summarization, documents classification, information retrieval and filtering. However several attempts have been made to develop efficient document clustering algorithms but most of the clustering methods suffer from challenges in dealing with problems of high dimensionality, scalability, accuracy and meaningful cluster labels. This paper intends to provide a brief summary over methods studied and current state of documents clustering research, including basic traditional methods as well as advanced fuzzy based, GA, PSO, HS oriented techniques etc. Also document representation model and its challenges, dimensionality reduction mechanisms, issues in document clustering, and cluster quality evaluation criteria are discussed.

References
  1. Johanna Geiß. July 2011. Latent semantic clustering for multi-documents summarization. UCAM-CL-TR-802 ISSN 1476-2986.
  2. Nicholas O. Andrews and Edward A. Fox. October 16, 2007. Recent Developments in Document Clustering. Technical Report TR-07-35, Computer Science, Virginia Tech.
  3. M. F. Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130-137.
  4. Farial Shahnaz and Michael W. Berry. March 2006. Document Clustering Using NonNegative Matrix Factorization. Information Processing and Management: an International Journal, Volume 42 Issue 2, Pages 373-386.
  5. Lee, D & Seung. 2001. Algorithms for non-negative matrix factorization. In T. G. Dietterich and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, Proceedings of the 2000 Conference: 556-562, The MIT Press.
  6. Hoyer, P. 2002. Non-Negative Sparse Coding. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, Martigny, Switzerland.
  7. Pauca, V, Shahnaz, F, Berry, MW & Plemmons R. April 22-24, 2004. Text Mining Using Non-Negative Matrix Factorizations. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, FL.
  8. Amy, L & Carl, M. 2006. ALS Algorithms Nonnegative Matrix Factorization Text Mining.
  9. Kirk Baker. March 29, 2005. Singular Value Decomposition Tutorial.
  10. Nguyen Chi Thanh and Koichi Yamada. September 2011. Document Representation and Clustering with Wordnet Based Similarity Rough Set Model. IJCSI Vol. 8, Issue 5, ISSN: 1694-0814.
  11. Chun-Ling Chen, Frank S. C. Tseng and Tyne Liang. September 2010. An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Journal of Data & Knowledge Engineering, Volume 69, Issue 11, Pages 1208-1226.
  12. Chistopher D. Manning, Prabhakar Raghvan, Hinrich Schutze, April 1, 2009. Introduction to Information Retrieval. Cambridge University Press, online edition(c).
  13. Michael Steinbach, George Karypis, Vipin Kumar. 2000. A Comparison of Document Clustering Techniques. Proc. Of the 6th ACM SIGMOD int'l conf. on Knowledge Discovery and Data Mining (KDD).
  14. Steve Jones and Malika Mahoui. October 2000. Hierarchical Document Clustering Using Automatically Extracted Keyphrases. Computer Science Working Papers 00/13, University of Waikato, Department of Computer Science.
  15. R. Kashef and M. S. Kamel. Nov. 2009. Enhanced bisecting k-means clustering using intermediate cooperation. Journal of Pattern Recognition, vol. 42, issue 11, pp. 2557-2569.
  16. J. N. Bhuyan, V. V. Raghavan, and V. K. Elayavalli. 1991. Genetic algorithm for clustering with an ordered representation. In Proc. 4th Int. Conf. Genetic Algorithms. San Mateo, CA: Morgan Kaufman.
  17. D. R. Jones and M. A. Beltramo. 1991. Solving partitioning problems with genetic algorithms. In Proc. 4th Int. Conf. Genetic Algorithms. San Mateo, CA: Morgan Kaufman.
  18. G. P. Babu, Apr. 1994. Connectionist and evolutionary approaches for pattern clustering. Ph. D. dissertation, Dept. Comput. Sci. Automat. , Indian Inst. Sci. , Bangalore.
  19. K. Krishna and M. Narasimha Murty, June 1999. Genetic K-Means Algorithm. IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 29, No. 3.
  20. Basher Al-Shboul, and Sung-Hyon Myaeng. 2009. Initializing K-Means using Genetic Algorithms. World Academy of Science, Engineering and Technology 54.
  21. Xiaohui Cui, Thomas E. Potok, Paul Palathingal. 2005. Document Clustering using Particle Swarm Optimization. IEEE Swarm Intelligence Symposium, (SIS).
  22. Xiaohui Cui and Thomas E. Potok, 2005. Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm. Journal of Computer Sciences (Special Issue): 27-33, ISSN 1549-3636.
  23. Mehrdad Mahdavi and Hassan Abolhassani, 2008. Harmony K-means algorithm for document clustering. In Data Mining and Knowledge Discovery (Springer) , 370-391.
  24. Benjamin C. M. Fung, Ke Wang, Martin Ester. 2003. Hierarchical Document Clustering Using Frequent Itemsets. In Proc. SIAM International Conference on Data Mining (SDM ).
  25. Florian Beil Martin Ester Xiaowei Xu, 2002. Frequent Term-Based Text Clustering. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Pages 436-442.
  26. R. Agrawal and R. Srikant, 12-15 1994. Fast algorithm for mining association rules. In J. B. Bo cca, M. Jarke, and C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB , pages 487–499. Morgan Kaufmann.
  27. Chun-Ling Chen, Frank S. C. Tseng and Tyne Liang. 2010. Mining fuzzy frequent itemsets for hierarchical document clustering. International Journal of Information Processing and Management, 46, 193-211.
  28. Chun-Ling Chen, Frank S. C. Tseng and Tyne Liang. September 2011. An integration of fuzzy association rules and WordNet for document clustering. Journal of Knowledge and Information Systems - Special Issue on Data Warehousing and Knowledge Discovery from Sensors and Streams, Volume 28, Issue 3, Pages 687-708.
Index Terms

Computer Science
Information Sciences

Keywords

document clustering hierarchical clustering partitioning clustering frequent item set vector space model