CFP last date
20 December 2024
Reseach Article

On the Impact of Dataset Characteristics on Arabic Document Classification

by Diab Abuaiadah, Jihad El Sana, Walid Abusalah
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 101 - Number 7
Year of Publication: 2014
Authors: Diab Abuaiadah, Jihad El Sana, Walid Abusalah
10.5120/17701-8680

Diab Abuaiadah, Jihad El Sana, Walid Abusalah . On the Impact of Dataset Characteristics on Arabic Document Classification. International Journal of Computer Applications. 101, 7 ( September 2014), 31-38. DOI=10.5120/17701-8680

@article{ 10.5120/17701-8680,
author = { Diab Abuaiadah, Jihad El Sana, Walid Abusalah },
title = { On the Impact of Dataset Characteristics on Arabic Document Classification },
journal = { International Journal of Computer Applications },
issue_date = { September 2014 },
volume = { 101 },
number = { 7 },
month = { September },
year = { 2014 },
issn = { 0975-8887 },
pages = { 31-38 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume101/number7/17701-8680/ },
doi = { 10.5120/17701-8680 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:31:04.519543+05:30
%A Diab Abuaiadah
%A Jihad El Sana
%A Walid Abusalah
%T On the Impact of Dataset Characteristics on Arabic Document Classification
%J International Journal of Computer Applications
%@ 0975-8887
%V 101
%N 7
%P 31-38
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper describes the impact of dataset characteristics on the results of Arabic document classification algorithms using TF-IDF representations. The experiments compared different stemmers, different categories and different training set sizes, and found that different dataset characteristics produced widely differing results, in one case attaining a remarkable 99% recall (accuracy). The use of a standard dataset would eliminate this variability and enable researchers to gain comparable knowledge from the published results.

References
  1. Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ, Prentice-Hall, 313-323.
  2. Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. In Information Processing and Management, vol. 24, no. 5, 513-523.
  3. Salton, G. 1989. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Boston: Addison-Wesley Longman.
  4. Cavnar, W. and Trenkle, J. 1994. N-Gram-Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
  5. Newsri, A. 2008. Effective Retrieval Techniques for Arabic Text (Doctoral dissertation). RMIT, Melbourne.
  6. Lovins, J. 1968. Development of a stemming algorithm. In Mechanical Translation and Computational Linguistics, vol. 11, 22-31.
  7. Syiam, M. , Fayed, Z. and Habib, M. 2006. An Intelligent System for Arabic Text Categorization. In International Journal of Intelligent Computing and Information Sciences, vol. 6, no. 1, 1-19.
  8. Al-Shammari, E. and Lin, J. 2008. Towards an error-free Arabic stemming. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM-iNEWS'08).
  9. Al-Kabi, N. and Al-Radaideh, A. 2011. Benchmarking and assessing the performance of Arabic stemmers. In Journal of Information Science, vol. 37, no. 2, 111-119.
  10. Shatnawi, M. , Yassein, M. and Mahafza, R. 2013. A framework for retrieving Arabic documents based on queries written in Arabic slang language. In Journal of Information Science, vol. 38, no. 4, 350-365.
  11. Lewis, D. 1997. Reuters-21578 text categorization test collection. Reuter.
  12. Elkourdi, M. , Bensaid, M. and Rachidi, T. 2004. Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages. Geneva.
  13. Al-Shalabi, R. and Evan, M. A computational morphology system for Arabic. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (COLING-ACL '98), Quebec, 1998.
  14. Mesleh, A. 2007. Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. In Journal of Computer Science, vol. 3, no. 6, 430-435.
  15. Al-Saleem, M. 2010. Associative Classification to Categorize Arabic Data Sets. In The International Journal of ACM Jordan (ISSN 2078-7952), vol. 1, no. 3, 118-127.
  16. Khreisat, L. 2006. Arabic Text Classification Using N-Gram Frequency Statistics: A Comparative Study. In Proceedings of the 2006 International Conference on Data Mining, DMIN'06.
  17. El-Halees, A. 2007. Arabic Text Classification Using Maximum Entropy. In The Islamic University Journal (Series of Natural Studies and Engineering), vol. 15, no. 1, 157-167.
  18. Zahran, B. and Kanaan, G. 2009. Text Feature Selection using Particle Swarm Optimization Algorithm. In World Applied Sciences Journal, vol. 7, 69-74.
  19. Kennedy, J. and Eberhart, R. 1995. Particle Swarm Optimization. In Proc. IEEE, International Conference on Neural Networks. Piscataway.
  20. Zaki, T. , Mammas, D. , Ennaji, A. and Nouboud, F. 2010. Classification of Arabic Documents by a Model of Fuzzy Proximity with a Radial Basis Function. In International Journal of Future Generation, Communication and Networking, vol. 3, no. 4.
  21. Khorsheed, M. S. , and Thubaity, A. O. 2013. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. In Language Resources and Evaluation, vol. 47, no. 2, 513-538.
  22. Ababneh, J. , Almomani, O. , Hadi, W. , El-Omari, N. and Al-Ibrahim, A. 2014. Vector Space Models to Classify Arabic Text. In International Journal of Computer Trends and Technology (IJCTT), vol. 7, no. 4.
  23. Zaki, T. , Es-saady, Y. , Mammass, D. , Ennaji, A. and Nicolas, S. 2014. A Hybrid Method N-Grams-TFIDF with radial basis for indexing and classification of Arabic documents. In International Journal of Software Engineering and Its Applications, vol. 7, no. 2, 127-144.
  24. Larkey, L. , Ballesteros, L. and Connell, M. 2007. Light Stemming for Arabic Information Retrieval. In Text, Speech and Language Technology, vol. 38, 221-243.
  25. Chen, A. and Gey, F. 2002. Building an Arabic stemmer for information retrieval. In NIST Special Publication 500-251: Proceedings of the Eleventh Text Retrieval Conference (TREC 2002).
  26. Khoja, S. and Garside, R. 1999. Stemming Arabic text. Lancaster University, Lancaster.
  27. Al-Shargabi, B. , Olayah, F. and Al-Romimah, W. 2011. An Experimental Study for the Effect of Stop Words Elimination for Arabic Text Classification Algorithms. In International Journal of Information Technology and Web Engineering (IJITWE), vol. 6, no. 2.
  28. Wahbeh, A. , Al-Kabi, M. , Al-Radaidah, Q. , Al-Shawakfa, E. and Alsamdi, I. 2011. The Effect of Stemming on Arabic Text Classification: An Empirical Study. In International Journal of Information Retrieval Research (IJIRR), vol. 1, no. 3.
Index Terms

Computer Science
Information Sciences

Keywords

Dataset TF-IDF representation Arabic Stemmers Arabic document classification