CFP last date
20 January 2025
Reseach Article

A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents

by B. S. Harish, M. B. Revanasiddappa
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 164 - Number 8
Year of Publication: 2017
Authors: B. S. Harish, M. B. Revanasiddappa
10.5120/ijca2017913711

B. S. Harish, M. B. Revanasiddappa . A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents. International Journal of Computer Applications. 164, 8 ( Apr 2017), 1-7. DOI=10.5120/ijca2017913711

@article{ 10.5120/ijca2017913711,
author = { B. S. Harish, M. B. Revanasiddappa },
title = { A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2017 },
volume = { 164 },
number = { 8 },
month = { Apr },
year = { 2017 },
issn = { 0975-8887 },
pages = { 1-7 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume164/number8/27500-2017913711/ },
doi = { 10.5120/ijca2017913711 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:10:43.816940+05:30
%A B. S. Harish
%A M. B. Revanasiddappa
%T A Comprehensive Survey on various Feature Selection Methods to Categorize Text Documents
%J International Journal of Computer Applications
%@ 0975-8887
%V 164
%N 8
%P 1-7
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Feature selection is one of the well known solution to high dimensionality problem of text categorization. In text categorization, selection of good features (terms) plays a very important role. Feature selection is a strategy that can be used to improve categorization accuracy, effectiveness and computational efficiency. This paper presents an empirical study of most widely used feature selection methods viz. Term Frequency-Inverse Document Frequency (tf idf ), Information Gain (IG), Mutual Information(MI), CHI-Square ( 2), Ambiguity Measure (AM), Term Strength (TS), Term Frequency-Relevance Frequency (tf rf ) and Symbolic Feature Selection (SFS) with five different classifiers (Nave Bayes, KNearest Neighbor, Centroid Based Classifier, Support Vector Machine and Symbolic Classifier). Experimentations are carried out on standard bench mark datasets like Reuters-21578, 20-Newsgroups and 4 University dataset.

References
  1. 20newsgroups. http://people.csail.mit.edu/jrennie/20Newsgroups/.
  2. 4 universities. http://www.cs.cmu.edu/afs/cs/project/theo- 20/www/data/.
  3. Reuters-21578. http://www.daviddlewis.com/resources/ testcollections/ reuters21578/.
  4. Rehman Abdur, Kashif Javed, and Haroon A Babri. Feature selection based on a normalized difference measure for text classification. Information Processing & Management, 53(2):473–489, 2017.
  5. Uysal Alper Kursat. An improved global feature selection scheme for text classification. Expert systems with Applications, 43:82–92, 2016.
  6. Uysal Alper Kursat and Gunal Serkan. A novel probabilistic feature selection method for text classification. Knowledge- Based Systems, 36:226–235, 2012.
  7. Hotho Andreas, Andreas N¨urnberger, and Gerhard Paaß. A brief survey of text mining. In Ldv Forum, volume 20, pages 19–62, 2005.
  8. Khan Aurangzeb, Baharum Baharudin, Lam Hong Lee, and Khairullah Khan. A review of machine learning algorithms for text-documents classification. Journal of advances in information technology, 1(1):4–20, 2010.
  9. Harish B S, D S Guru, and S Manjunath. Representation and classification of text documents: A brief review. IJCA, Special Issue on RTIPPR (2), pages 110–119, 2010.
  10. Harish B S, D S Guru, S Manjunath, and Bapu B Kiranagi. A symbolic approach for text classification based on dissimilarity measure. In Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia, pages 104–108. ACM, 2010.
  11. Lee Changki and Lee Gary-Geunbae. Information gain and divergence-based feature selection for machine learningbased text categorization. Information processing & management, 42(1):155–165, 2006.
  12. Han Eui-Hong Sam and Karypis George. Centroid-based document classification: Analysis and experimental results. In European conference on principles of data mining and knowledge discovery, pages 424–431. Springer, 2000.
  13. Han Eui-Hong Sam, George Karypis, and Vipin Kumar. Text categorization using weight adjusted k-nearest neighbor classification. In Pacific-asia conference on knowledge discovery and data mining, pages 53–65. Springer, 2001.
  14. Song Fengxi, Shuhai Liu, and Jingyu Yang. A comparative study on text representation schemes in text categorization. Pattern analysis and applications, 8(1-2):199–209, 2005.
  15. Forman George. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3(Mar):1289–1305, 2003.
  16. Salton Gerard and Buckley Christopher. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.
  17. Bakus Jan and Kamel Mohamed S. Higher order feature selection for text classification. Knowledge and Information Systems, 9(4):468–491, 2006.
  18. Novovi?cov´a Jana, Anton´in Mal´ik, and Pavel Pudil. Feature selection using improved mutual information for text classification. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 1010–1017. Springer, 2004.
  19. Church Kenneth-Ward and Hanks Patrick. Word association norms, mutual information, and lexicography. Computational linguistics, 16(1):22–29, 1990.
  20. Jing Li-Ping, Hou-Kuan Huang, and Hong-Bo Shi. Improved feature selection approach tfidf in text mining. In Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on, volume 2, pages 944–946. IEEE, 2002.
  21. Zhiying Liu and Yang Jieming. An improved ambiguity measure feature selection for text categorization. In Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2012 4th International Conference on, volume 1, pages 220–223. IEEE, 2012.
  22. Revanasiddappa M B, B S Harish, and S Manjunath. Document classification using symbolic classifiers. In Contemporary Computing and Informatics (IC3I), 2014 International Conference on, pages 299–303. IEEE, 2014.
  23. Lan Man, Sam-Yuan Sung, Hwee-Boon Low, and Chew-Lim Tan. A comparative study on term weighting schemes for text categorization. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 1, pages 546–551. IEEE, 2005.
  24. Lan Man, Chew Lim Tan, and Hwee-Boon Low. Proposing a new term weighting scheme for text categorization. In AAAI, volume 6, pages 763–768, 2006.
  25. Lan Man, Chew Lim Tan, Jian Su, and Yue Lu. Supervised and traditional term weighting methods for automatic text categorization. IEEE transactions on pattern analysis and machine intelligence, 31(4):721–735, 2009.
  26. Hoque Nazrul, DK Bhattacharyya, and Jugal K Kalita. Mifsnd: a mutual information-based feature selection method. Expert Systems with Applications, 41(14):6371–6385, 2014.
  27. Bidi Noria and Elberrichi Zakaria. Feature selection for text classification using genetic algorithms. In Modelling, Identification and Control (ICMIC), 2016 8th International Conference on, pages 806–810. IEEE, 2016.
  28. Mukras Rahman, Nirmalie Wiratunga, Robert Lothian, Sutanu Chakraborti, and David Harper. Information gain feature selection for ordinal text classification using probability redistribution. In Proceedings of the Textlink workshop at IJCAI, volume 7, page 16, 2007.
  29. Mengle Saket SR and Goharian Nazli. Using ambiguity measure feature selection algorithm for support vector machine classifier. In Proceedings of the 2008 ACM symposium on Applied computing, pages 916–920. ACM, 2008.
  30. Mengle Saket SR and Goharian Nazli. Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science and Technology, 60(5):1037–1050, 2009.
  31. Scott Sam. Feature engineering for a symbolic approach to text classification. University of Ottawa (Canada), 1998.
  32. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1–47, 2002.
  33. Jiang Shengyi, Guansong Pang, Meiling Wu, and Limin Kuang. An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1):1503–1509, 2012.
  34. Qu Shouning, Sujuan Wang, and Yan Zou. Improvement of text feature selection method based on tfidf. In Future Information Technology and Management Engineering, 2008. FITME’08. International Seminar on, pages 79–81. IEEE, 2008.
  35. Mitchell T. Machine learning. McGraw-Hill, Inc. New York, NY, USA, 1997.
  36. Korde Vandana and Mahender C Namrata. Text classification and classifiers: A survey. International Journal of Artificial Intelligence & Applications, 3(2):85, 2012.
  37. Wilbur W John and Sirotkin Karl. The automatic identification of stop words. Journal of information science, 18(1):45– 55, 1992.
  38. Shang Wenqian, Houkuan Huang, Haibin Zhu, Yongmin Lin, Youli Qu, and Zhihai Wang. A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1):1–5, 2007.
  39. Xu Yang, Gareth JF Jones, JinTao Li, Bin Wang, and Chun- Ming Sun. A study on mutual information-based feature selection for text categorization. Journal of Computational Information Systems, 3(3):1007–1012, 2007.
  40. Yang Yiming and Wilbur John. Using corpus statistics to remove redundant words in text categorization. JASIS, 47(5):357–369, 1996.
Index Terms

Computer Science
Information Sciences

Keywords

High Dimensionality Feature Selection Classifiers Text Categorization