CFP last date
20 December 2024
Reseach Article

Survey Paper on Feature Extraction Methods in Text Categorization

by Dixa Saxena, S. K. Saritha, K. N. S. S. V. Prasad
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 166 - Number 11
Year of Publication: 2017
Authors: Dixa Saxena, S. K. Saritha, K. N. S. S. V. Prasad
10.5120/ijca2017914145

Dixa Saxena, S. K. Saritha, K. N. S. S. V. Prasad . Survey Paper on Feature Extraction Methods in Text Categorization. International Journal of Computer Applications. 166, 11 ( May 2017), 11-17. DOI=10.5120/ijca2017914145

@article{ 10.5120/ijca2017914145,
author = { Dixa Saxena, S. K. Saritha, K. N. S. S. V. Prasad },
title = { Survey Paper on Feature Extraction Methods in Text Categorization },
journal = { International Journal of Computer Applications },
issue_date = { May 2017 },
volume = { 166 },
number = { 11 },
month = { May },
year = { 2017 },
issn = { 0975-8887 },
pages = { 11-17 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume166/number11/27712-2017914145/ },
doi = { 10.5120/ijca2017914145 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:13:24.923169+05:30
%A Dixa Saxena
%A S. K. Saritha
%A K. N. S. S. V. Prasad
%T Survey Paper on Feature Extraction Methods in Text Categorization
%J International Journal of Computer Applications
%@ 0975-8887
%V 166
%N 11
%P 11-17
%D 2017
%I Foundation of Computer Science (FCS), NY, USA
Abstract

As the world is moving towards globalization, digitization of text has been escalating a lot and the need to organize, categorize and classify text has become obligatory. Disorganization or little categorization and sorting of text may result in dawdling response time of information retrieval. There has been the ‘curse of dimensionality’ (as termed by Bellman)[1] problem, namely the inherent sparsity of high dimensional spaces. Thus, the search for a possible presence of some unspecified structure in such a high dimensional space can be difficult. This is the task of feature reduction methods. They obtain the most relevant information from the original data and represent the information in a lower dimensionality space. In this paper, all the applied methods on feature extraction on text categorization from the traditional bag-of-words model approach to the unconventional neural networks are discussed.

References
  1. R. E. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ, USA, 1957.
  2. Isabelle Guyon, Andr´e Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003) 1157-1182.
  3. PPV CJRS, Canadian Journal of Remote Sensing - 36(6):pp. 645-649; Comparison of feature extraction methods in dimensionality reduction, Electronic.
  4. Soumya George K, Shibily Joseph, Text Classification by Augmenting Bag of Words (BOW) Representation with Co-occurrence Feature, IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 1, Ver. V (Jan. 2014), PP 34-38
  5. Raghavan, V. V. and Wong, S. K. M. A critical analysis of vector space model for information retrieval. Journal of the American Society for Information Science, Vol.37 (5), p. 279-87, 1986.
  6. Salton, Gerard. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
  7. van Rijsbergen, C. J. Information retrieval. Butterworths, 1979.
  8. Luhn, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development 2 (2), p. 159-165 and 317, April 1958.
  9. Salton, Gerard. Automatic Text Processing. Addison-Wesley Publishing Company, 1988.
  10. A. Salton, G. Wong, C. Yang, A vector space model for automatic indexing, Communications of the ACM 18 (1975) 613–620.
  11. W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, Z. Wang, A novel feature selection algorithm for text categorization, Expert Systems with Applications 33 (2007) 1–5.
  12. X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Proceedings of Neural Information Processing Systems, 2005, pp. 505–512.
  13. Y. Li, C. Luo, S. Chung, Text clustering with feature selection by using statistical data, IEEE Transactions on Knowledge and Data Engineering 20 (2008) 641–652.
  14. X. Wang, K. Paliwal, Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition, Pattern Recognition 36 (2003) 2429–2439.
  15. L. Shi, J. Zhangm, E. Liu, P. He, Text classification based on nonlinear dimensionality reduction techniques and support vector machines, in: Proceedings of the Third International Conference on Natural Computation, 2007, pp. 674–677.
  16. W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, Z. Wang, A novel feature selection algorithm for text categorization, Expert Systems with Applications 33 (2007) 1–5.
  17. Y. Yang, Noise reduction in a statistical approach to text categorization, in: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95), 1995, pp. 256–263.
  18. I. Kononenko, Estimating attributes: analysis and extensions of relief, in: Proc. European Conference on Machine Learning, Springer-Verlag, 1994, pp. 171–182).
  19. K. Kira, L. Rendell, The feature selection problem: traditional methods and a new algorithm, in: Association for the Advancement of Artificial Intelligence, AAAI Press and MIT Press, 1992, pp. 129–134.
  20. L. Liu, J. Kang, J. Yu, Z. Wang, A comparative study on unsupervised feature selection methods for text clustering, in: IEEE International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp. 597–601
  21. H. Uguz, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems 24 (2012) 1024–1032.
  22. H. Uguz, A hybrid system based on information gain and principal component analysis for the classification of transcranial doppler signals, Computer Methods and Programs in Biomedicine 107 (2011) 598–609.
  23. J. Menga, H. Lin, Y. Yu, A two-stage feature selection method for text categorization, Knowledge-Based Systems 62 (2011) 2793–2800.
  24. H. Hsu, C. Hsieh, M. Lu, Hybrid feature selection by combining filters and wrappers, Expert Systems with Applications 38 (2011) 8144–8150.
  25. A. Akadi, A. Amine, A. Ouardighi, D. About ajdine, A two-stage gene selection scheme utilizing MRMR filter and GA wrapper, Knowledge and Information System 26 (2011) 487–500.
  26. Kusum Kumari Bharti∗, P.K. Singh, A three-stage unsupervised dimension reduction method for text clustering, Journal of Computational Science,2013
  27. K. Pearson, On lines and planes of closest filt to systems of points in space, Philosophical Magazine 1 (1901) 559–572.
  28. J. Shlens, A tutorial on principal component analysis, Systems Neurobiology Laboratory, University of California at San Diego, 2005.
  29. Jianchang Mao and Ani1 K. Jain, Artificial Neural Networks for Feature Extraction and Multivariate Data Projection, IEEE TRANSACTIONS ON NELRAL NETWORKS. VOL 6. NO. 2. MARCH 1995.
  30. R. L. Hoffman and A. K. Jain, “Segmentation and Classification of Range Images,” IEEE Trans. Part. Anal. Mach. Intell.. vol. PAMI-9. no. 5, G. 608220, 1987.
  31. K. Hornik and C.-M. Kuan, “Convergence analysis of local feature extraction algorithm,” Neural Networks, vol. 5, pp. 229-240, 1992.
  32. W. Y. Huang and R. P. Lippmann, Comparisons between neural net and traditional classifiers,” in IEEE 1st Int. Conf. Neural Networks. San Diego, CA, June 1987, pp. IV-485-IV-493.
  33. P. J. Huber, “Projection pursuit,” Ann. Statist.. vol. 13, pp. 435475, 1985.
  34. Rajat K. De, Jayanta Basak, Sankar K. Pal, Unsupervised feature extraction using neuro-fuzzy approach, Fuzzy Sets and Systems 126 (2002) 277–291.
  35. Xiang Zhang, Junbo Zhao, Yann LeCun, Character-level Convolutional Networks for Text Classification.
  36. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997.
  37. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
  38. Y. Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014, Association for Computational Linguistics.
  39. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Nov. 2011.
  40. A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
  41. K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015.
  42. R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML 2013, volume 28 of JMLR Proceedings, pages 1310–1318. JMLR.org, 2013.
  43. Xiang Zhang, Yann LeCun. Text Understanding from Scratch. arXiv 1502.01710. Datasets. Code.
  44. M.Ramya , J.Alwin Pinakas, Different Type of Feature Selection for Text Classification, International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 2 – Apr 2014.
  45. Asir Antony Gnana Singh Danasingh, Jebamalar Leavline Epiphany, Feature Extraction for Document Classification, www.researchgate.net/publication/276950476, 2015.
  46. Zena M. Hira and Duncan F. Gillies, A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data, Advances in Bioinformatics Volume 2015 (2015), Article ID 198363, 13 pages.
  47. Sandya H. B., Hemanth Kumar P. , Himanshi Bhudiraja, Susham K. Rao, Fuzzy Rule Based Feature Extraction and Classification of Time Series Signal, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-3, Issue-2, May 2013.
  48. Hoang Vu Nguyen, Vivekanand Gopalkrishnan, Feature Extraction for Outlier Detection in High-Dimensional Spaces, Journal of Machine Learning Research, Volume 10, Issue 2, 2010, pp. 252-262
Index Terms

Computer Science
Information Sciences

Keywords

Bag of words algorithm