Text Document Clustering based on Phrase Similarity using Affinity Propagation

Shailendra Kumar Shrivastava; J. L. Rana; R. C. Jain

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Text Document Clustering based on Phrase Similarity using Affinity Propagation

by Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 61 - Number 18

Year of Publication: 2013

Authors: Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain

10.5120/10032-5077

Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain . Text Document Clustering based on Phrase Similarity using Affinity Propagation. International Journal of Computer Applications. 61, 18 ( January 2013), 38-44. DOI=10.5120/10032-5077

@article{ 10.5120/10032-5077,

author = { Shailendra Kumar Shrivastava, J. L. Rana, R. C. Jain },

title = { Text Document Clustering based on Phrase Similarity using Affinity Propagation },

journal = { International Journal of Computer Applications },

issue_date = { January 2013 },

volume = { 61 },

number = { 18 },

month = { January },

year = { 2013 },

issn = { 0975-8887 },

pages = { 38-44 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume61/number18/10032-5077/ },

doi = { 10.5120/10032-5077 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T21:09:49.640786+05:30

%A Shailendra Kumar Shrivastava

%A J. L. Rana

%A R. C. Jain

%T Text Document Clustering based on Phrase Similarity using Affinity Propagation

%J International Journal of Computer Applications

%@ 0975-8887

%V 61

%N 18

%P 38-44

%D 2013

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Affinity propagation (AP) was recently introduced as an un-supervised learning algorithm for exemplar based clustering. In this paper novel text document clustering algorithm has been developed based on vector space model, phrases and affinity propagation clustering algorithm. Proposed algorithm can be called Phrase affinity clustering (PAC). PAC first finds the phrase by ukkonen suffix tree construction algorithm, second finds the vector space model using tf-idf weighting scheme of phrase. Third calculate the similarity matrix form VSD using cosine similarity . In Last affinity propagation algorithm generate the clusters . F-Measure ,Purity and Entropy of Proposed algorithm is better than GAHC ,ST-GAHC and ST-KNN on OHSUMED ,RCV1 and News group data sets.

References

RuiXu Donald C. Winch, "Clustering" , IEEE Press 2009 ,pp 1-282
Jain, A. and Dubes R. "Algorithms for Clustering Data ", Englewood Cliffs, NJ Prentice Hall, 1988.
A. K. Jain, M. N. Murthy and P. J. Flynn, "Data Clustering: A Review ", ACM Computing Surveys, Vol. 31. No 3, September 1999, pp 264-322.
RuiXu, and Donald Wunsch," Survey of Clustering. Algorithms ", IEEE Transactions on Neural Network, Vol 16, No. 3, 2005 pp 645.
Frey, B. J. and DueckD. " Clustering by Passing Messages Between Data Points ", Science 2007, pp 972–976.
Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhangand Tao Guo, "Adaptive Affinity Propagation Clustering",ActaAutomaticaSinica, 2007 ,1242-1246.
Yancheng He , Qingcai Chen, Xiaolong Wang, Ruifeng Xu, Xiaohua Bai and Xianjun Meng ," An Adaptive Affinity Propagation Document Clustering", 7th International Conference on Informatics and Systems (INFOS), 2010,pp 1-7.
Salton G. , Wong A. , and Yang C. S. , 1975, "A Vector Space Model for Automatic Indexing," Comm. ACM, vol. 18, no. 11, pp. 613-620.
Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen Yang, and Yanchun Liang," Text Clustering with Seeds Affinity Propagation ",IEEE Transaction on Knowledge and Data Engineering Vol. 23 No 4,2011,pp 627-637
O. M. Oren Zamir, O. Etzioni, and R. M. Karp, "Fast and Intuitive Clustering of Web Documents," Proc. Third Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1997.
Oren Zamir and Oren Etzioni. Grouper: a dynamic clustering interface to Web search results. Computer Networks (Amsterdam, Netherlands: 1999.
Sven Meyer D. S. , Eissen Zu. and Potthast M. , 2005, "The Suffix Tree Document Model Revisited," Proc. Fifth Int'l Conf. Knowledge Management (I-Know '05), pp. 596-603.
D. D. Lewis, Y. Yang, and F. Li, "RCV1: A New Benchmark Collection for Text Categorization Research," J. Machine Learning Research, vol. 5, pp. 361-397, 2004.
Hung Chim ,Xiaotie Deng ,"A New Suffix Tree Similarity Measure for Document",Proceedings of the 16th international conference on World Wide Web ACM New York, NY, USA ,2007 ,Pages 121-130
Chim H. and Deng X. , 2008 "Efficient Phrase Based Document Similarity for Clustering", IEEE Trans. Knowledge and Data Engineering, vol. 20, No. 9.
P. Weiner. Fast and effective text mining using linear-time document clustering. In Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory, The University of lowa, 1973.
Edward M. McCreight. A space-economical Suffix tree construction algorithm. Journal of ACM, 1976. Page 154
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 1995,Page 260.
Sven Meyer D. S. , Eissen Zu. and Potthast M. , 2005, "The Suffix Tree Document Model Revisited," Proc. Fifth Int'l Conf. Knowledge Management (I-Know '05), pp. 596-603.
Hersh W. , Buckley C. , and Hickam D. , 1994, "Ohsumed: An Interactive Retrieval Evaluation and New Large Test Collection for Research," Proc. 17th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '94), pp. 192-201

Index Terms

Computer Science

Information Sciences

Keywords

text clustering affinity propagation unsupervised learning vector space model suffix tree tf-idf weighting scheme Purity Entropy-measure GAHC ST-GAHC ST-KNN