CFP last date
20 December 2024
Reseach Article

Hybrid Approach for Punjabi Text Clustering

by Saurabh Sharma, Vishal Gupta
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 52 - Number 1
Year of Publication: 2012
Authors: Saurabh Sharma, Vishal Gupta
10.5120/8167-1407

Saurabh Sharma, Vishal Gupta . Hybrid Approach for Punjabi Text Clustering. International Journal of Computer Applications. 52, 1 ( August 2012), 32-36. DOI=10.5120/8167-1407

@article{ 10.5120/8167-1407,
author = { Saurabh Sharma, Vishal Gupta },
title = { Hybrid Approach for Punjabi Text Clustering },
journal = { International Journal of Computer Applications },
issue_date = { August 2012 },
volume = { 52 },
number = { 1 },
month = { August },
year = { 2012 },
issn = { 0975-8887 },
pages = { 32-36 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume52/number1/8167-1407/ },
doi = { 10.5120/8167-1407 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:51:12.249825+05:30
%A Saurabh Sharma
%A Vishal Gupta
%T Hybrid Approach for Punjabi Text Clustering
%J International Journal of Computer Applications
%@ 0975-8887
%V 52
%N 1
%P 32-36
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure and placing dissimilar documents into different clusters. Most of the popular clustering algorithms treats document as conglomeration of words and do not consider the syntactic or semantic relations between words. To overcome this drawback, some algorithms were proposed which aimed at trying to find connections among different words in a sentence by using different concepts, e. g. Frequent Itemsets, Frequent Words Sequences, Frequent Word Meaning Sequences, Ontology based clustering. In this paper, we proposed a hybrid algorithm for clustering of Punjabi text document, which uses semantic relations among words in a sentence for extracting phrases. Phrases extracted create a feature vector of the document which is used for finding similarity among all documents. Results on experiment data reveal that hybrid algorithm is more reasonable and has a better performance with real time data sets.

References
  1. Pandey, A. K. and Siddiqui, T. J. 2008. An unsupervised Hindi stemmer with heuristic improvements. In Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 99-105, ACM New York, NY, USA. ISBN: 978-1-60558-196-5 doi>10. 1145/1390749. 1390765.
  2. Bharati, A. , Sangal, R. 1990. A karaka based approach to parsing of Indian languages. In Proceedings of the 13th conference on Computational linguistics - Volume 3, pp. 25-29. Association for Computational Linguistics Stroudsburg, PA, USA. ISBN:952-90-2028-7 doi>10. 3115/991146. 991151
  3. Benjamin C. M. Fung, Ke Wang, Martin Ester. 2003. Hierarchical Document Clustering Using Frequent Itemsets. IN Proceedings of SIAM International Conference on Data Mining.
  4. Yanjun Li, Soon M. Chung, John D. Holt. 2008. Text document clustering based on frequent word meaning sequences. Data & Knowledge Engineering, Volume 64 Issue 1, (Jan. 2008), 381-404. Elsevier Science Publishers B. V. Amsterdam, The Netherlands, The Netherlands. doi>10. 1016/j. datak. 2007. 08. 001
  5. Hartigan, J. A. and Wong, M. A. 1979. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C (Applied Statistics) Volume 28, No. 1, (1979), 100-108. Blackwell Publishing.
  6. Anil K. Jain, Richard C. Dubes. 1988. Algorithms for clustering data. Prentice-Hall, Inc. Upper Saddle River, NJ, USA. ISBN:0-13-022278-X
  7. T. Kohonen. 1995. Self-organizing Maps, Series in Information Sciences, vol. 30, Springer.
  8. Choudhary, B. and Bhattacharyya, P. 2002. Text clustering using semantics. In Proceedings of the 11th International World Wide Web Conference.
  9. Salton, G. , Wong, A. and Yang, C. S. 1975. A vector space model for automatic indexing. Communications of the ACM, Volume 18 Issue 11, Nov. 1975, pp. 613 - 620. ACM New York, NY, USA. doi>10. 1145/361219. 361220
  10. Agrawal R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Data Bases. pp. 487 - 499. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA. ISBN:1-55860-153-8.
  11. Bharati, A. and Sangal, R. 1993. Parsing free word order languages in the Paninian framework. In Proceedings of the 31st annual meeting on Association for Computational Linguistics, pp. 105-111. Association for Computational Linguistics Stroudsburg, PA, USA. doi>10. 3115/981574. 981589
  12. Kiparsky, P. 1982. Some Theoretical Problems in Panini's Grammar, Bhandarkar Oriental Research Institute, Poona, India.
  13. Cardona, G. 1976. Panini: A Survey of Research, Mouton, Hague-Paris.
  14. Cardona, G. 1988. Panini: His Work and Its Tradition (Vol. 1: Background and Introduction), Motilal Banarsidas, Delhi.
Index Terms

Computer Science
Information Sciences

Keywords

Punjabi Text Clustering Vector Space Model Frequent Itemsets Frequent Word Sequences Karaka Theory Ontology