A Comparative Study on using Principle Component Analysis with different Text Classifiers

D. A. Eisa; Ahmed I. Taloba; Safaa S. I. Ismail

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

A Comparative Study on using Principle Component Analysis with different Text Classifiers

by D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 180 - Number 31

Year of Publication: 2018

Authors: D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail

10.5120/ijca2018916800

D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail . A Comparative Study on using Principle Component Analysis with different Text Classifiers. International Journal of Computer Applications. 180, 31 ( Apr 2018), 1-6. DOI=10.5120/ijca2018916800

@article{ 10.5120/ijca2018916800,

author = { D. A. Eisa, Ahmed I. Taloba, Safaa S. I. Ismail },

title = { A Comparative Study on using Principle Component Analysis with different Text Classifiers },

journal = { International Journal of Computer Applications },

issue_date = { Apr 2018 },

volume = { 180 },

number = { 31 },

month = { Apr },

year = { 2018 },

issn = { 0975-8887 },

pages = { 1-6 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume180/number31/29239-2018916800/ },

doi = { 10.5120/ijca2018916800 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-07T01:02:19.383703+05:30

%A D. A. Eisa

%A Ahmed I. Taloba

%A Safaa S. I. Ismail

%T A Comparative Study on using Principle Component Analysis with different Text Classifiers

%J International Journal of Computer Applications

%@ 0975-8887

%V 180

%N 31

%P 1-6

%D 2018

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Text categorization (TC) is the task of automatically organizing a set of documents into a set of pre-defined categories. Over the last few years, increased attention has been paid to the use of documents in digital form and this makes text categorization becomes a challenging issue. The most significant problem of text categorization is its huge number of features. Most of these features are redundant, noisy and irrelevant that cause over fitting with most of the classifiers. Hence, feature extraction is an important step to improve the overall accuracy and the performance of the text classifiers. In this paper, we will provide an overview of using principle component analysis (PCA) as a feature extraction with various classifiers. It was observed that the performance rate of the classifiers after using PCA to reduce the dimension of data improved. Experiments are conducted on three UCI data sets, Classic03, CNAE-9 and DBWorld e-mails. We compare the classification performance results of using PCA with popular and well-known text classifiers. Results show that using PCA encouragingly enhances classification performance on most of the classifiers.

References

N. Ur-Rahman and J. A. Harding, “Textual data mining for industrial knowledge management and text classification: A business oriented approach,” Expert Systems with Applications, vol. 39, no. 5, pp. 4729–4739, 2012.
Y. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, (San Francisco, CA, USA), pp. 412–420, Morgan Kaufmann Publishers Inc., 1997.
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” Machine learning: ECML-98, pp. 137–142, 1998.
L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
M. Ghiassi, M. Olschimke, B. Moon, and P. Arnaudo, “Automated text classification using a dynamic artificial neural network model,” Expert Systems with Applications, vol. 39, no. 12, pp. 10967–10976, 2012.
M. Zareapoor and K. Seeja, “Feature Extraction or Feature Selection for Text Classification: A Case Study on Phishing Email Detection,” International Journal of Information Engineering and Electronic Business, vol. 7, no. 2, p. 60, 2015.
N. Cheng, R. Chandramouli, and K. Subbalakshmi, “Author gender identification from text,” Digital Investigation, vol. 8, no. 1, pp. 78–88, 2011.
J. Verbeek, “Supervised feature extraction for text categorization,” in Tenth Belgian-Dutch Conference on Machine Learning (Benelearn’00), 2000.
S. L. Lam and D. L. Lee, “Feature reduction for neural network based text categorization,” in Database Systems for Advanced Applications, 1999. Proceedings., 6th International Conference on, pp. 195–202, IEEE, 1999.
A. Selamat and S. Omatu, “Web page feature selection and classification using neural networks,” Information Sciences, vol. 158, pp. 69–88, 2004.
J.-T. Sun, Z. Chen, H.-J. Zeng, Y.-C. Lu, C.-Y. Shi, and W.-Y. Ma, “Supervised latent semantic indexing for document categorization,” in Data Mining, 2004. ICDM’04. Fourth IEEE International Conference on, pp. 535–538, IEEE, 2004.
N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215, ACM, 2000.
A. I. Taloba, M. R. Riad, and T. H. A. Soliman, “Developing an efficient spectral clustering algorithm on large scale graphs in spark,” in Intelligent Computing and Information Systems (ICICIS), 2017 Eighth International Conference on, pp. 292– 298, IEEE, 2017.
J. C. Gomez, E. Boiy, and M.-F. Moens, “Highly discriminative statistical features for email classification,” Knowledge and information systems, vol. 31, no. 1, pp. 23–53, 2012.
S. Karamizadeh, S. M. Abdullah, A. A. Manaf, M. Zamani, and A. Hooman, “An overview of principal component analysis,” Journal of Signal and Information Processing, vol. 4, no. 03, pp. 173–175, 2013.
M. Lichman, “UCI machine learning repository,” 2013.
G. Forman and E. Kirshenbaum, “Extremely fast text feature extraction for classification and indexing,” in Proceedings of the 17th ACM conference on Information and knowledge management, pp. 1221–1230, ACM, 2008.
H. U?guz, “A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm,” Knowledge-Based Systems, vol. 24, no. 7, pp. 1024–1032, 2011.
H. U?guz, “A multistage feature selection model for document classification using information gain and rough set,” International Journal of Advanced Research in Artificial Intelligence( IJARAI), vol. 3, no. 11, 2014.
S. Vidhya, D. A. A. G. Singh, and E. J. Leavline, “Feature Extraction for Document Classification,” International Journal of Innovative Research in Science,Engineering and Technology( IJIRSET), vol. 4, no. 6, pp. 50–56, 2015.
M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137, 1980.
M. F. Porter, “Effective PreProcessing Activities in Text Mining using Improved Porter’s Stemming Algorithm,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536–4538, 2013.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
Z. Li, R. Yuan, and X. Guan, “Accurate classification of the internet traffic based on the svm method,” in Communications, 2007. ICC’07. IEEE International Conference on, pp. 1373– 1378, IEEE, 2007.
C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011.
J.Wang and X. Li, “An improved KNN algorithm for text classification,” in Information Networking and Automation (ICINA), 2010 International Conference on, vol. 2, pp. V2–436, IEEE, 2010.
G. Guo, H. Wang, D. Bell, Y. Bi, and K. Greer, “KNN modelbased approach in classification,” in OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”, pp. 986–996, Springer, 2003.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10– 18, 2009.

Index Terms

Computer Science

Information Sciences

Keywords

Text Categorization Dimension Reduction Feature Extraction Principle Component Analysis Classifiers