CFP last date
20 February 2025
Reseach Article

Enhancing Feature Selection using Statistical Data with Unigrams and Bigrams

by M.Janaki Meena, K.R.Chandran, J.Mary Brinda, P.R.Sindhu
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 1 - Number 11
Year of Publication: 2010
Authors: M.Janaki Meena, K.R.Chandran, J.Mary Brinda, P.R.Sindhu
10.5120/250-407

M.Janaki Meena, K.R.Chandran, J.Mary Brinda, P.R.Sindhu . Enhancing Feature Selection using Statistical Data with Unigrams and Bigrams. International Journal of Computer Applications. 1, 11 ( February 2010), 7-11. DOI=10.5120/250-407

@article{ 10.5120/250-407,
author = { M.Janaki Meena, K.R.Chandran, J.Mary Brinda, P.R.Sindhu },
title = { Enhancing Feature Selection using Statistical Data with Unigrams and Bigrams },
journal = { International Journal of Computer Applications },
issue_date = { February 2010 },
volume = { 1 },
number = { 11 },
month = { February },
year = { 2010 },
issn = { 0975-8887 },
pages = { 7-11 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume1/number11/250-407/ },
doi = { 10.5120/250-407 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T19:43:22.236204+05:30
%A M.Janaki Meena
%A K.R.Chandran
%A J.Mary Brinda
%A P.R.Sindhu
%T Enhancing Feature Selection using Statistical Data with Unigrams and Bigrams
%J International Journal of Computer Applications
%@ 0975-8887
%V 1
%N 11
%P 7-11
%D 2010
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Feature selection is an essential preprocessing step for classifiers with high dimensional training corpus. Features for text categorization include words, phrases, sentences or distribution of words. The complexity of classifying documents to related categories is on higher scale in comparison with unrelated categories. A feature selection algorithm based on chi-square statistics, have been proposed for Naïve Bayes classifier. The proposed feature selection method identifies the related features for a class and determines the type of dependency between the feature and category. In this paper, the proposed method ascertains related phrases and words as features. A comparison of the conventional chi-square method is made with the proposed method. Experiments were conducted with randomly chosen training documents from one unrelated and five closely related categories of 20Newsgroup Benchmarks. It is observed that the proposed method has better precision and recall.

References
  1. G Yanjun Li, Congnan Luo, and Soon M. Chung, “Text Clustering with Feature Selection by Using Statistical Data,” IEEE Transactions on Knowledge and Data Engineering, Volume 20, Issue 5, pp 641 – 652, May 2008.
  2. Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, and Sung Hyon Myaebg, “Some Effective Techniques for Naïve Bayes Text Classification,” IEEE Transactions on Knowledge and Data Engineering, Volume 18, No. 11, pp 1457 – 1466, November 2006.
  3. Hisham Al-Mubaid and Syed A. Umair, “A New Text Categorization Technique using Distributional Clustering and Learning Logic,” IEEE Transactions on Knowledge and Data Engineering, Volume 18, No. 9, pp 1156 – 1165, September, 2006.
  4. Andrew McCallum and Kamal Nigam, “A Comparison of Event Models for Naïve Bayes Text Classification,” in AAAI-98 Workshop on Learning for Text Categorization, 1998.
  5. Vangelis Metsis, Ion Androutsoplos and Georgios Paliouras, “Spam Filtering with Naïve Bayes – Which Naïve Bayes?,” in Proc. CEAS 2006, Third Conference on Email and Anti-Spam, Mountain View, California USA, July 27-28,2006.
  6. Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger, “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers,” in Proc. of the twentieth International Conference on Machine Learning, 2003.
  7. Ciya Liao, Shamim Alpha, Paul Dixon “Feature Preparation in Text Categorization”, Oracle Corporation, 1997.
  8. Yiming Yang and Jan O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” in Proc. 1997. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.9956.
  9. Jason D. M. Rennie, “Improving Multi-class Text Classification with Naïve Bayes”, M. S. Thesis, Submitted at Dept. of EE and CS, Massachusetts Institute of Technology, 2001.
  10. P. Domingos and M.J. Pazzani, “On the optimality of the Simple Bayesian Classifer under zero-One Loss”, Machine Learning, volume 29, nos. 2/3, pp. 103 – 130, 1997.
  11. Fabrizio Sebastiani “Text Categorization”, in Proc. Text Mining and its Applications to Intelligence, CRM and Knowledge Management, 2005.
  12. Zhaohui Zheng, Xiaoyun Wu and Rohini Srihari, “Feature selection for Text Categorization on Imbalanced Data”, ACM SIGKDD Explorations Newsletter, Special Issue on learning from Imbalanced Data, Volume 6, Issue 1, pp. 80 – 89, June 2004.
  13. George Forman, “Feature Selection : We’ve barely scratched the surface” An essay requested for IEEE Intelligent Systems, Trends and Controversies, 2005.
  14. M. Dash, H. Liu, “Feature Selection for Classification” , Intelligent Data Analysis, 1997 pp 131-156.
Index Terms

Computer Science
Information Sciences

Keywords

Text Classification Naïve Bayes Classifier Supervised learning Feature selection