CFP last date
20 December 2024
Reseach Article

A Comparative Study on using Principle Component Analysis with different Text Classifiers

by D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 180 - Number 31
Year of Publication: 2018
Authors: D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail
10.5120/ijca2018916800

D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail . A Comparative Study on using Principle Component Analysis with different Text Classifiers. International Journal of Computer Applications. 180, 31 ( Apr 2018), 1-6. DOI=10.5120/ijca2018916800

@article{ 10.5120/ijca2018916800,
author = { D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail },
title = { A Comparative Study on using Principle Component Analysis with different Text Classifiers },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2018 },
volume = { 180 },
number = { 31 },
month = { Apr },
year = { 2018 },
issn = { 0975-8887 },
pages = { 1-6 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume180/number31/29239-2018916800/ },
doi = { 10.5120/ijca2018916800 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T01:02:19.383703+05:30
%A D. A. Eisa
%A Ahmed I. Taloba
%A Safaa S. I. Ismail
%T A Comparative Study on using Principle Component Analysis with different Text Classifiers
%J International Journal of Computer Applications
%@ 0975-8887
%V 180
%N 31
%P 1-6
%D 2018
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text categorization (TC) is the task of automatically organizing a set of documents into a set of pre-defined categories. Over the last few years, increased attention has been paid to the use of documents in digital form and this makes text categorization becomes a challenging issue. The most significant problem of text categorization is its huge number of features. Most of these features are redundant, noisy and irrelevant that cause over fitting with most of the classifiers. Hence, feature extraction is an important step to improve the overall accuracy and the performance of the text classifiers. In this paper, we will provide an overview of using principle component analysis (PCA) as a feature extraction with various classifiers. It was observed that the performance rate of the classifiers after using PCA to reduce the dimension of data improved. Experiments are conducted on three UCI data sets, Classic03, CNAE-9 and DBWorld e-mails. We compare the classification performance results of using PCA with popular and well-known text classifiers. Results show that using PCA encouragingly enhances classification performance on most of the classifiers.

References
  1. N. Ur-Rahman and J. A. Harding, “Textual data mining for industrial knowledge management and text classification: A business oriented approach,” Expert Systems with Applications, vol. 39, no. 5, pp. 4729–4739, 2012.
  2. Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, (San Francisco, CA, USA), pp. 412–420, Morgan Kaufmann Publishers Inc., 1997.
  3. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” Machine learning: ECML-98, pp. 137–142, 1998.
  4. L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  5. M. Ghiassi, M. Olschimke, B. Moon, and P. Arnaudo, “Automated text classification using a dynamic artificial neural network model,” Expert Systems with Applications, vol. 39, no. 12, pp. 10967–10976, 2012.
  6. M. Zareapoor and K. Seeja, “Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection,” International Journal of Information Engineering and Electronic Business, vol. 7, no. 2, p. 60, 2015.
  7. N. Cheng, R. Chandramouli, and K. Subbalakshmi, “Author gender identification from text,” Digital Investigation, vol. 8, no. 1, pp. 78–88, 2011.
  8. J. Verbeek, “Supervised feature extraction for text categorization,” in Tenth Belgian-Dutch Conference on Machine Learning (Benelearn’00), 2000.
  9. S. L. Lam and D. L. Lee, “Feature reduction for neural network based text categorization,” in Database Systems for Advanced Applications, 1999. Proceedings., 6th International Conference on, pp. 195–202, IEEE, 1999.
  10. A. Selamat and S. Omatu, “Web page feature selection and classification using neural networks,” Information Sciences, vol. 158, pp. 69–88, 2004.
  11. J.-T. Sun, Z. Chen, H.-J. Zeng, Y.-C. Lu, C.-Y. Shi, and W.-Y. Ma, “Supervised latent semantic indexing for document categorization,” in Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on, pp. 535–538, IEEE, 2004.
  12. N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215, ACM, 2000.
  13. A. I. Taloba, M. R. Riad, and T. H. A. Soliman, “Developing an efficient spectral clustering algorithm on large scale graphs in spark,” in Intelligent Computing and Information Systems (ICICIS), 2017 Eighth International Conference on, pp. 292– 298, IEEE, 2017.
  14. J. C. Gomez, E. Boiy, and M.-F. Moens, “Highly discriminative statistical features for email classification,” Knowledge and information systems, vol. 31, no. 1, pp. 23–53, 2012.
  15. S. Karamizadeh, S. M. Abdullah, A. A. Manaf, M. Zamani, and A. Hooman, “An overview of principal component analysis,” Journal of Signal and Information Processing, vol. 4, no. 03, pp. 173–175, 2013.
  16. M. Lichman, “UCI machine learning repository,” 2013.
  17. G. Forman and E. Kirshenbaum, “Extremely fast text feature extraction for classification and indexing,” in Proceedings of the 17th ACM conference on Information and knowledge management, pp. 1221–1230, ACM, 2008.
  18. H. U?guz, “A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm,” Knowledge-Based Systems, vol. 24, no. 7, pp. 1024–1032, 2011.
  19. H. U?guz, “A multistage feature selection model for document classification using information gain and rough set,” International Journal of Advanced Research in Artificial Intelligence( IJARAI), vol. 3, no. 11, 2014.
  20. S. Vidhya, D. A. A. G. Singh, and E. J. Leavline, “Feature Extraction for Document Classification,” International Journal of Innovative Research in Science,Engineering and Technology( IJIRSET), vol. 4, no. 6, pp. 50–56, 2015.
  21. M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137, 1980.
  22. M. F. Porter, “Effective PreProcessing Activities in Text Mining using Improved Porter’s Stemming Algorithm,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536–4538, 2013.
  23. C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  24. Z. Li, R. Yuan, and X. Guan, “Accurate classification of the internet traffic based on the svm method,” in Communications, 2007. ICC’07. IEEE International Conference on, pp. 1373– 1378, IEEE, 2007.
  25. C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
  26. J.Wang and X. Li, “An improved KNN algorithm for text classification,” in Information Networking and Automation (ICINA), 2010 International Conference on, vol. 2, pp. V2–436, IEEE, 2010.
  27. G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “KNN modelbased approach in classification,” in OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”, pp. 986–996, Springer, 2003.
  28. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10– 18, 2009.
Index Terms

Computer Science
Information Sciences

Keywords

Text Categorization Dimension Reduction Feature Extraction Principle Component Analysis Classifiers