CFP last date
20 January 2025
Reseach Article

Concept Space Derivation and its Application in Query Categorization

by Yashodhara Haribhakta, Parag Kulkarni
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 70 - Number 6
Year of Publication: 2013
Authors: Yashodhara Haribhakta, Parag Kulkarni
10.5120/11966-7807

Yashodhara Haribhakta, Parag Kulkarni . Concept Space Derivation and its Application in Query Categorization. International Journal of Computer Applications. 70, 6 ( May 2013), 14-22. DOI=10.5120/11966-7807

@article{ 10.5120/11966-7807,
author = { Yashodhara Haribhakta, Parag Kulkarni },
title = { Concept Space Derivation and its Application in Query Categorization },
journal = { International Journal of Computer Applications },
issue_date = { May 2013 },
volume = { 70 },
number = { 6 },
month = { May },
year = { 2013 },
issn = { 0975-8887 },
pages = { 14-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume70/number6/11966-7807/ },
doi = { 10.5120/11966-7807 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:32:09.645357+05:30
%A Yashodhara Haribhakta
%A Parag Kulkarni
%T Concept Space Derivation and its Application in Query Categorization
%J International Journal of Computer Applications
%@ 0975-8887
%V 70
%N 6
%P 14-22
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Automatic text categorization (also known as text classification or topic spotting) is the activity of labeling natural language texts with thematic categories from a predefined set. For the purpose of classifying the text documents, there is a need for a set of features and a weighting model which gives relevance value to the features. Currently, bag of words(BOW) is found to be the most widely accepted text representation method . This representation has two major drawbacks. First, the amount of features is very large; second, there is no relatedness between the words. Topic Detection (TD) technique helps the BOW to handle the two drawbacks by detection features very relevant to the document in the space of concept. However, existing TD techniques were not designed for text categorization and often involve huge computational complexity and cost. This paper proposes a topic detection technique for relevant feature extraction. The TD technique extracts topics along with relevant features for each text document. It then finds relatedness between features for each topic. The features extracted for each topic are tightly related to the topic and accordingly the category label. The term frequency measure selects the appropriate features by finding frequency count for each extracted feature for each category label. Thus, the TD technique extracts the relevant features for the classifiers for classification. To evaluate the TD technique, a query categorization system is designed and proposed. The experiments were performed on three datasets ( Reuters 21578, Ohsumed and 2G Scam) . The experimental results for TD technique show that the topics, along with the set of keywords, detected for documents are indeed relevant . Also, the query categorization system showed satisfactory performance in categorizing the queries using the TD technique.

References
  1. Yaming Yang and Xin Liu " A reexamination of text categorization methods" . In annual ACM conference on Research and Development in Information Retrieval . pp 42-49, 1999.
  2. Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW '06, pages 533-542, 2006.
  3. Q. Mei and C. Zhai. A mixture model for contextual text mining. In Proceedings of KDD '06, pages 649-655, 2006.
  4. David D. Lewis " Reuters 21578" dataset.
  5. G. Salton, A. Wong and C. S. Yang "A Vector Space Model for Automatic Indexing" Communication of the ACM , November 1975 vol 18 number 11.
  6. Xiao-Bing Xue, Zhi-Hua Zhou ," Distributional Features for Text Categorization, IEEE Transactions on Knowledge and Data Engineering, Volume 21 Issue 3, March 2009 ,Pages 428-442.
  7. C. Apte, F. Damerau, and S. M. Weiss, " Automated Learning of Decision Rules for Text Categorization", ACM Transactions on Information Systems, 1994.
  8. D. Koller and M Sahami. 1997. " Hierarchically classifying documents using very few words". In Proceedings of the International Conference on Machine Learning (ICML).
  9. Lewis, D. , and Ringuette, M. 1994. A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and IR.
  10. Y. H. LI and A. K. JAIN, " Classification of Text Documents. , The COMPUTER Journal, Vol. 41, No. 8, 1998.
  11. Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, December 1999.
  12. Moulinier et. al , Text categorization : A symbolic approach. In annual symposium on document analysis and Information retrieval (SDAIR),1996.
  13. Moulinier and Ganascia J. , " Applying and an existing machine learning algorithms to text categorization" , in Connectionist, statistical, and symbolic approaches to learning for NLP, springer verlag, 1996.
  14. Schutze H, Hull D. and Pederson J," A comparison of classifiers and document representation for the routing problem". ACM SIGIR conference on research and Development in IR, 1995.
  15. George A. Miller, Claudia Leacock, Randee Tengi, Ross T. Bunker. , A Semantic Concordance. , Proceedings of the 3rd DARPA Workshop on Human Language Technology, 1993.
  16. Miller, G. , R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. ,Five papers on WordNet. , CSL Report 43, Cognitive Science Laboratory, Princeton Uni versity, 1990.
  17. Yair Amit, Danny Allan, Adi Sharabani, Overtaking Google Desktop – A Security Analysis, A whitepaper from watch_re, 2007.
  18. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR '99, pages 50-57, 1999.
  19. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res. , 3:993-1022,2003.
  20. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD '04, pages 743-748, 2004.
  21. M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffths. Probabilistic author-topic models for information discovery. In Proceedings of KDD '04, pages 306-315, 2004.
  22. W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of ICML, pages 577-584, 2006.
  23. Miller, G. A. , Wordnet : A lexical database for English,2010
Index Terms

Computer Science
Information Sciences

Keywords

Concept Derivation Topic detection BOW representation Performance Evaluation