CFP last date
20 January 2025
Reseach Article

Evaluation of Clustering around Weighted Prototype and Genetic Algorithm for Document Categorization

by Garima Jain, Shailendra Kumar Shrivastava
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 125 - Number 14
Year of Publication: 2015
Authors: Garima Jain, Shailendra Kumar Shrivastava
10.5120/ijca2015906260

Garima Jain, Shailendra Kumar Shrivastava . Evaluation of Clustering around Weighted Prototype and Genetic Algorithm for Document Categorization. International Journal of Computer Applications. 125, 14 ( September 2015), 21-27. DOI=10.5120/ijca2015906260

@article{ 10.5120/ijca2015906260,
author = { Garima Jain, Shailendra Kumar Shrivastava },
title = { Evaluation of Clustering around Weighted Prototype and Genetic Algorithm for Document Categorization },
journal = { International Journal of Computer Applications },
issue_date = { September 2015 },
volume = { 125 },
number = { 14 },
month = { September },
year = { 2015 },
issn = { 0975-8887 },
pages = { 21-27 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume125/number14/22500-2015906260/ },
doi = { 10.5120/ijca2015906260 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:16:02.994452+05:30
%A Garima Jain
%A Shailendra Kumar Shrivastava
%T Evaluation of Clustering around Weighted Prototype and Genetic Algorithm for Document Categorization
%J International Journal of Computer Applications
%@ 0975-8887
%V 125
%N 14
%P 21-27
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document clustering is very important in the field of text categorization. Genetic algorithm, which is an optimization based technique which can be applied for finding out the best cluster centres easily by computing fitness values of data points. While clustering around weighted prototype technique is especially helpful when proper pairwise similarities are available. This technique does not find global solution of the objective function. Experimental result shows that F-measure and Normalized mutual information of genetic algorithm is better than clustering around weighted prototype for 20 Newsgroup dataset. F-measure and accuracy of genetic algorithm is better than clustering around weighted prototype for the Reuter-21578 dataset.

References
  1. F. Sebastiani, Machine learning in automated text categorization, ACM Comp. Surveys. 34 (1) (2008) 1–47
  2. G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manage. (1988) 513-523
  3. P. Turney, P. Pantel, from frequency to meaning: vector space models of semantics, J. Artif. Intell. 37 (2010)141- 188
  4. Jun, S., Park, S.-S., & Jang, D.-S. (2014). Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications, 41, 3204–3212.
  5. Yutaka Matsuo, Mitsuru Ishizuka “Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information,” FLAIRS 2003.
  6. M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in Automated Text Categorization,” Technical Report IEI-B4-07-2000, Institution Elaborazione dell’Informazione.
  7. S.Shehata, F. Karray, and M. Kamel, “A Concept-Based Model for Enhancing Text Categorization,” Proc. 13th Int’l Conf. Knowledge Discovery and Data Mining (KDD ’07), pp. 629-637, 2007.
  8. Zhong, S. (2005). Efficient online spherical k-means clustering. In Proceedings of the IEEE international joint conference on neural networks (pp. 3180–3185).
  9. Jian-Ping Mei, Lihui Chen (2014). Proximity-based k-partitions clustering with ranking for document categorization and analysis. Expert System with Applications.
  10. T. W. Schoenharl and G. Madey, “Evaluation of measurement techniques for the validation of agent-based simulations against streaming data,” in Proc. ICCS, Kraków, Poland, 2008.
  11. Rui Xu Donald C. Wunsch, II”Clustering” John Wiley & Sons, INC., Publication, 2009.
  12. Deng-Yiv Chiu, Ya-Chen Pan, Topic knowledge map and knowledge structure constructions with genetic algorithm, information retrieval, and multi-dimension scaling method, Knowledge-Based System, Vol. 67,
  13. Clustering Ensemble: A Multiobjective Genetic Algorithm based Approach, Science Direct, 2013.
  14. Zhao, Y., & Karypis, G. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10, 141–168.
  15. Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster analysis. New York: Wiley.
  16. Guha, S., Rastogi, R., & Shim, K. (2001). CURE: An efficient clustering algorithm for large databases. Information Systems, 26, 35–58
  17. Bellec, J. -H., & Kechadi, M. -T. (2007). CUFRES: Clustering using fuzzy representative events selection for the fault recognition problem in telecommunication networks. In PIKM (pp. 55–62).
  18. Halkidi, M., & Vazirgiannis, M. (2008). A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29, 773–786.
  19. Mei, J.-P., & Chen, L. (2010). Fuzzy clustering with weighted medoids for relational data. Pattern Recognition, 43, 1964–1974.
  20. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San Francisco, CA, USA: Morgan Kaufmann; Boston, MA, USA: Elsevier, 2006.
  21. C. G. González, W. Bonventi, Jr., and A. L. V. Rodrigues, “Density of closed balls in real-valued and autometrized boolean spaces for clustering applications,” in Proc. 19th Brazilian Symp. Artif. Intell., Savador, Brazil, 2008, pp. 8–22.
  22. Lang, K. (1995). NewsWeeder: Learning to filter netnews. In Proceedings of the 12th international conference on machine learning (pp. 331–339).
  23. http://kdd.ics.uci.edu/databases/reuters21578/reuters2157 8.html.
  24. Marina Sokolova, Guy Lapalme. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45 (2009) 427–437
  25. Strehl, A., & Ghosh, J. (2002). Cluster ensembles – knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research, 3, 583–617
  26. L. Liu, J. Kang, J. Yu, and Z. Wang, “A comparative study on unsupervised feature selection methods for text clustering,” in Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE ’05), pp.597–601,November2005.
  27. D. Zhang, S. Chen, and Z.-H. Zhou, “Constraint score: a new filter method for feature selection with pairwise constraints,” Pattern Recognition, vol.41, no.5, pp.1440–1451, 2008.
  28. R. Feldman, J. Sanger, The text mining handbook advanced approaches in analyzing unstructured data, ABS Vent. (2006)
  29. H. Altyncay, Z. Erenel, Analytical evaluation of term weighting schemes for text categorization, Patt. Recog. Lett. 31 (2010) 1310–1323.
  30. M. Lan, C.L. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, Trans. PAMI 31 (4) (2009) 721– 735
  31. F. Debole, F. Sebastiani, Supervised term weighting for automated text categorization, in: Proceedings of the 2003 ACM Symposium on Applied Computing, SAC ’03, ACM, New York, NY, USA, 2003, pp. 784–788.
  32. Krishnasamy, G., Kulkarni, A. J., & Paramesran, R. (2014). A hybrid approach for data clustering based on modified cohort intelligence and k-means. Expert Systems with Applications, 41, 6009–6016.
  33. Bing Liu. Web data mining. Second Edition, Springer, 2011.
  34. Wei Song, Soon Cheol Park, Genetic algorithm for text clustering based on latent semantic indexing, Computers and Mathematics with Applications 57 (2009) 1901_1907.
Index Terms

Computer Science
Information Sciences

Keywords

Clustering Similarity Based Genetic Algorithm Document Categorization Text mining.