International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 1 - Number 11 |
Year of Publication: 2010 |
Authors: M.Janaki Meena, K.R.Chandran, J.Mary Brinda, P.R.Sindhu |
10.5120/250-407 |
M.Janaki Meena, K.R.Chandran, J.Mary Brinda, P.R.Sindhu . Enhancing Feature Selection using Statistical Data with Unigrams and Bigrams. International Journal of Computer Applications. 1, 11 ( February 2010), 7-11. DOI=10.5120/250-407
Feature selection is an essential preprocessing step for classifiers with high dimensional training corpus. Features for text categorization include words, phrases, sentences or distribution of words. The complexity of classifying documents to related categories is on higher scale in comparison with unrelated categories. A feature selection algorithm based on chi-square statistics, have been proposed for Naïve Bayes classifier. The proposed feature selection method identifies the related features for a class and determines the type of dependency between the feature and category. In this paper, the proposed method ascertains related phrases and words as features. A comparison of the conventional chi-square method is made with the proposed method. Experiments were conducted with randomly chosen training documents from one unrelated and five closely related categories of 20Newsgroup Benchmarks. It is observed that the proposed method has better precision and recall.