CFP last date
20 December 2024
Reseach Article

Text Document Clustering based on Phrase Similarity using Affinity Propagation

by Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 61 - Number 18
Year of Publication: 2013
Authors: Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain
10.5120/10032-5077

Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain . Text Document Clustering based on Phrase Similarity using Affinity Propagation. International Journal of Computer Applications. 61, 18 ( January 2013), 38-44. DOI=10.5120/10032-5077

@article{ 10.5120/10032-5077,
author = { Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain },
title = { Text Document Clustering based on Phrase Similarity using Affinity Propagation },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 61 },
number = { 18 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 38-44 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume61/number18/10032-5077/ },
doi = { 10.5120/10032-5077 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:09:49.640786+05:30
%A Shailendra Kumar Shrivastava
%A J. L. Rana
%A R. C. Jain
%T Text Document Clustering based on Phrase Similarity using Affinity Propagation
%J International Journal of Computer Applications
%@ 0975-8887
%V 61
%N 18
%P 38-44
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Affinity propagation (AP) was recently introduced as an un-supervised learning algorithm for exemplar based clustering. In this paper novel text document clustering algorithm has been developed based on vector space model, phrases and affinity propagation clustering algorithm. Proposed algorithm can be called Phrase affinity clustering (PAC). PAC first finds the phrase by ukkonen suffix tree construction algorithm, second finds the vector space model using tf-idf weighting scheme of phrase. Third calculate the similarity matrix form VSD using cosine similarity . In Last affinity propagation algorithm generate the clusters . F-Measure ,Purity and Entropy of Proposed algorithm is better than GAHC ,ST-GAHC and ST-KNN on OHSUMED ,RCV1 and News group data sets.

References
  1. RuiXu Donald C. Winch, "Clustering" , IEEE Press 2009 ,pp 1-282
  2. Jain, A. and Dubes R. "Algorithms for Clustering Data ", Englewood Cliffs, NJ Prentice Hall, 1988.
  3. A. K. Jain, M. N. Murthy and P. J. Flynn, "Data Clustering: A Review ", ACM Computing Surveys, Vol. 31. No 3, September 1999, pp 264-322.
  4. RuiXu, and Donald Wunsch," Survey of Clustering. Algorithms ", IEEE Transactions on Neural Network, Vol 16, No. 3, 2005 pp 645.
  5. Frey, B. J. and DueckD. " Clustering by Passing Messages Between Data Points ", Science 2007, pp 972–976.
  6. Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhangand Tao Guo, "Adaptive Affinity Propagation Clustering",ActaAutomaticaSinica, 2007 ,1242-1246.
  7. Yancheng He , Qingcai Chen, Xiaolong Wang, Ruifeng Xu, Xiaohua Bai and Xianjun Meng ," An Adaptive Affinity Propagation Document Clustering", 7th International Conference on Informatics and Systems (INFOS), 2010,pp 1-7.
  8. Salton G. , Wong A. , and Yang C. S. , 1975, "A Vector Space Model for Automatic Indexing," Comm. ACM, vol. 18, no. 11, pp. 613-620.
  9. Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen Yang, and Yanchun Liang," Text Clustering with Seeds Affinity Propagation ",IEEE Transaction on Knowledge and Data Engineering Vol. 23 No 4,2011,pp 627-637
  10. O. M. Oren Zamir, O. Etzioni, and R. M. Karp, "Fast and Intuitive Clustering of Web Documents," Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1997.
  11. Oren Zamir and Oren Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks (Amsterdam, Netherlands: 1999.
  12. Sven Meyer D. S. , Eissen Zu. and Potthast M. , 2005, "The Suffix Tree Document Model Revisited," Proc. Fifth Int'l Conf. Knowledge Management (I-Know '05), pp. 596-603.
  13. D. D. Lewis, Y. Yang, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research," J. Machine Learning Research, vol. 5, pp. 361-397, 2004.
  14. Hung Chim ,Xiaotie Deng ,"A New Suffix Tree Similarity Measure for Document",Proceedings of the 16th international conference on World Wide Web ACM New York, NY, USA ,2007 ,Pages 121-130
  15. Chim H. and Deng X. , 2008 "Efficient Phrase Based Document Similarity for Clustering", IEEE Trans. Knowledge and Data Engineering, vol. 20, No. 9.
  16. P. Weiner. Fast and effective text mining using linear-time document clustering. In Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory, The University of lowa, 1973.
  17. Edward M. McCreight. A space-economical Suffix tree construction algorithm. Journal of ACM, 1976. Page 154
  18. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 1995,Page 260.
  19. Sven Meyer D. S. , Eissen Zu. and Potthast M. , 2005, "The Suffix Tree Document Model Revisited," Proc. Fifth Int'l Conf. Knowledge Management (I-Know '05), pp. 596-603.
  20. Hersh W. , Buckley C. , and Hickam D. , 1994, "Ohsumed: An Interactive Retrieval Evaluation and New Large Test Collection for Research," Proc. 17th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '94), pp. 192-201
Index Terms

Computer Science
Information Sciences

Keywords

text clustering affinity propagation unsupervised learning vector space model suffix tree tf-idf weighting scheme Purity Entropy-measure GAHC ST-GAHC ST-KNN