International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 70 - Number 6 |
Year of Publication: 2013 |
Authors: Yashodhara Haribhakta, Parag Kulkarni |
10.5120/11966-7807 |
Yashodhara Haribhakta, Parag Kulkarni . Concept Space Derivation and its Application in Query Categorization. International Journal of Computer Applications. 70, 6 ( May 2013), 14-22. DOI=10.5120/11966-7807
Automatic text categorization (also known as text classification or topic spotting) is the activity of labeling natural language texts with thematic categories from a predefined set. For the purpose of classifying the text documents, there is a need for a set of features and a weighting model which gives relevance value to the features. Currently, bag of words(BOW) is found to be the most widely accepted text representation method . This representation has two major drawbacks. First, the amount of features is very large; second, there is no relatedness between the words. Topic Detection (TD) technique helps the BOW to handle the two drawbacks by detection features very relevant to the document in the space of concept. However, existing TD techniques were not designed for text categorization and often involve huge computational complexity and cost. This paper proposes a topic detection technique for relevant feature extraction. The TD technique extracts topics along with relevant features for each text document. It then finds relatedness between features for each topic. The features extracted for each topic are tightly related to the topic and accordingly the category label. The term frequency measure selects the appropriate features by finding frequency count for each extracted feature for each category label. Thus, the TD technique extracts the relevant features for the classifiers for classification. To evaluate the TD technique, a query categorization system is designed and proposed. The experiments were performed on three datasets ( Reuters 21578, Ohsumed and 2G Scam) . The experimental results for TD technique show that the topics, along with the set of keywords, detected for documents are indeed relevant . Also, the query categorization system showed satisfactory performance in categorizing the queries using the TD technique.