Hybrid Approach for Punjabi Text Clustering

Saurabh Sharma; Vishal Gupta

Call for Paper

October Edition

IJCA solicits high quality original research papers for the upcoming October edition of the journal. The last date of research paper submission is 22 September 2025

Submit your paper

Know more

The week's pick

Real-Time Video Transmission using Gaussian Minimum Shift Keying (GMSK) on GNU Radio and USRP for Radiation Monitoring Applications in Nuclear Reactors

Nabiha Ben Abid Abdalla M. Khattab Hani A.M. Harb Chokri Souani

Random Articles

Reseach Article

Hybrid Approach for Punjabi Text Clustering

by Saurabh Sharma, Vishal Gupta

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 52 - Number 1

Year of Publication: 2012

Authors: Saurabh Sharma, Vishal Gupta

10.5120/8167-1407

Saurabh Sharma, Vishal Gupta . Hybrid Approach for Punjabi Text Clustering. International Journal of Computer Applications. 52, 1 ( August 2012), 32-36. DOI=10.5120/8167-1407

@article{ 10.5120/8167-1407,

author = { Saurabh Sharma, Vishal Gupta },

title = { Hybrid Approach for Punjabi Text Clustering },

journal = { International Journal of Computer Applications },

issue_date = { August 2012 },

volume = { 52 },

number = { 1 },

month = { August },

year = { 2012 },

issn = { 0975-8887 },

pages = { 32-36 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume52/number1/8167-1407/ },

doi = { 10.5120/8167-1407 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:51:12.249825+05:30

%A Saurabh Sharma

%A Vishal Gupta

%T Hybrid Approach for Punjabi Text Clustering

%J International Journal of Computer Applications

%@ 0975-8887

%V 52

%N 1

%P 32-36

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure and placing dissimilar documents into different clusters. Most of the popular clustering algorithms treats document as conglomeration of words and do not consider the syntactic or semantic relations between words. To overcome this drawback, some algorithms were proposed which aimed at trying to find connections among different words in a sentence by using different concepts, e. g. Frequent Itemsets, Frequent Words Sequences, Frequent Word Meaning Sequences, Ontology based clustering. In this paper, we proposed a hybrid algorithm for clustering of Punjabi text document, which uses semantic relations among words in a sentence for extracting phrases. Phrases extracted create a feature vector of the document which is used for finding similarity among all documents. Results on experiment data reveal that hybrid algorithm is more reasonable and has a better performance with real time data sets.

References

Pandey, A. K. and Siddiqui, T. J. 2008. An unsupervised Hindi stemmer with heuristic improvements. In Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 99-105, ACM New York, NY, USA. ISBN: 978-1-60558-196-5 doi>10. 1145/1390749. 1390765.
Bharati, A. , Sangal, R. 1990. A karaka based approach to parsing of Indian languages. In Proceedings of the 13th conference on Computational linguistics - Volume 3, pp. 25-29. Association for Computational Linguistics Stroudsburg, PA, USA. ISBN:952-90-2028-7 doi>10. 3115/991146. 991151
Benjamin C. M. Fung, Ke Wang, Martin Ester. 2003. Hierarchical Document Clustering Using Frequent Itemsets. IN Proceedings of SIAM International Conference on Data Mining.
Yanjun Li, Soon M. Chung, John D. Holt. 2008. Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, Volume 64 Issue 1, (Jan. 2008), 381-404. Elsevier Science Publishers B. V. Amsterdam, The Netherlands, The Netherlands. doi>10. 1016/j. datak. 2007. 08. 001
Hartigan, J. A. and Wong, M. A. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics) Volume 28, No. 1, (1979), 100-108. Blackwell Publishing.
Anil K. Jain, Richard C. Dubes. 1988. Algorithms for clustering data. Prentice-Hall, Inc. Upper Saddle River, NJ, USA. ISBN:0-13-022278-X
T. Kohonen. 1995. Self-organizing Maps, Series in Information Sciences, vol. 30, Springer.
Choudhary, B. and Bhattacharyya, P. 2002. Text clustering using semantics. In Proceedings of the 11th International World Wide Web Conference.
Salton, G. , Wong, A. and Yang, C. S. 1975. A vector space model for automatic indexing. Communications of the ACM, Volume 18 Issue 11, Nov. 1975, pp. 613 - 620. ACM New York, NY, USA. doi>10. 1145/361219. 361220
Agrawal R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Data Bases. pp. 487 - 499. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. ISBN:1-55860-153-8.
Bharati, A. and Sangal, R. 1993. Parsing free word order languages in the Paninian framework. In Proceedings of the 31st annual meeting on Association for Computational Linguistics, pp. 105-111. Association for Computational Linguistics Stroudsburg, PA, USA. doi>10. 3115/981574. 981589
Kiparsky, P. 1982. Some Theoretical Problems in Panini's Grammar, Bhandarkar Oriental Research Institute, Poona, India.
Cardona, G. 1976. Panini: A Survey of Research, Mouton, Hague-Paris.
Cardona, G. 1988. Panini: His Work and Its Tradition (Vol. 1: Background and Introduction), Motilal Banarsidas, Delhi.

Index Terms

Computer Science

Information Sciences

Keywords

Punjabi Text Clustering Vector Space Model Frequent Itemsets Frequent Word Sequences Karaka Theory Ontology