CFP last date
20 December 2024
Reseach Article

Improving Arabic Text Categorization using Normalization and Stemming Techniques

by Rouhia M. Sallam, Hamdy M. Mousa, Mahmoud Hussein
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 135 - Number 2
Year of Publication: 2016
Authors: Rouhia M. Sallam, Hamdy M. Mousa, Mahmoud Hussein
10.5120/ijca2016908328

Rouhia M. Sallam, Hamdy M. Mousa, Mahmoud Hussein . Improving Arabic Text Categorization using Normalization and Stemming Techniques. International Journal of Computer Applications. 135, 2 ( February 2016), 38-43. DOI=10.5120/ijca2016908328

@article{ 10.5120/ijca2016908328,
author = { Rouhia M. Sallam, Hamdy M. Mousa, Mahmoud Hussein },
title = { Improving Arabic Text Categorization using Normalization and Stemming Techniques },
journal = { International Journal of Computer Applications },
issue_date = { February 2016 },
volume = { 135 },
number = { 2 },
month = { February },
year = { 2016 },
issn = { 0975-8887 },
pages = { 38-43 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume135/number2/24025-2016908328/ },
doi = { 10.5120/ijca2016908328 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:35:47.124566+05:30
%A Rouhia M. Sallam
%A Hamdy M. Mousa
%A Mahmoud Hussein
%T Improving Arabic Text Categorization using Normalization and Stemming Techniques
%J International Journal of Computer Applications
%@ 0975-8887
%V 135
%N 2
%P 38-43
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Text Categorization is a technique for assigning documents based on their contents to one or more pre-defined categories. Achieving highest categorization accuracy remains one of the major challenges and it is also time consuming. We proposed approach to tackle these challenges. The proposed approach uses Frequency Ratio Accumulation Method (FRAM) as a classifier. Its features are represented using bag of word technique and an improved Term Frequency (TF) technique is used in features selection. The proposed approach is tested with known datasets. The experiments are done without both of normalization and stemming, with one of them, and with both of them. The obtained results of proposed approach are generally improved compared to existing techniques.The performance attributes of proposed Arabic Text Categorization approach were considered: Accuracy, Recall, Precision and F-measure (F1). The averages of the obtained results are 97.50%, 97.50%, 97.51%, and 97.49% respectively using normalization.

References
  1. Tripathi N., 2012.Level Text Classification Using Hybrid Machine Learning Techniques. PhD thesis, University of Sunderland.
  2. Harrag F., El"Qawasmeh E., 2009.Neural Network for Arabic text classification. 778 – 783.
  3. Sharef B., Omar N., and Sharef Z., 2014. An Automated Arabic Text Categorization Based on the Frequency Ratio Accumulation.The International Arab Journal of Information Technology, Vol. 11, No. 2, March 2014, 213-221.
  4. Suzuki M., Hirasawa S.,2007. Text Categorization Based on the Ratio of Word Frequency in Each Categories.In Proceedings of IEEE International Conference on Systems Man and Cybernetics, Montreal, Canada, 3535-3540.
  5. Laila, K., 2006. Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study.Conference on Data Mining | DMIN'06 |,78-82
  6. Al-Shargabi B., AL-RomimahW.andOlayah F.,2011. A Comparative Study for Arabic Text Classification Algorithms Based on Stop Words Elimination. ACM, Amman, Jordan 978-1-4503-0474-0/04/2011.
  7. Nezreg H.,Lehbab H., and BelbachirH. ,2014.ConceptualRepresentation Using WordNet for Text Categorization.International Journal of Computer and Communication Engineering, Vol. 3, No. 1, January 2014.
  8. Diederich J., KindermannJ., Leopold E. and Paass G.,2003. Authorship attribution with supportvector machines. Applied Intelligence, 109-123. (2003).
  9. Sebastiani F, 2002. Machine learning in automated text categorization. ACM Computing Surveys,Vol. 34 number 1. 1-47.
  10. Mesleh A. A., 2007.Chi Square Feature Extraction Based Svms Arabic Language TextCategorization System. Journal of Computer Science 3(6): 430-435.
  11. Oraby Sh., El-SonbatyY.and El-Nasr M., 2013. Exploring the Effects of Word Roots for Arabic Sentiment Analysis.,International Joint Conference on Natural Language Processing, 471–479,Nagoya, Japan, 14-18 October 2013.
  12. Ezzeldin A., El-Sonbaty Y. and Kholief.M, 2013.Exploring the Effects of Root Expansion.College of Computing and Information Technology, AASTMT Alexandria, Egypt.
  13. Al-ShalabiR.,Kanaan G., Jaam J.M, HasnahA.andHilat E. 2004.Stop-word Removal Algorithm for Arabic Language. Proceedings of 1st International Conference on Information & Communication Technologies: from Theory to Applications, IEEE-France, 545-550,CTTA'04
  14. El-Kourdi M., Bensaid A. and Rachidi T.,2004.Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. 20th International Conference on Computational Linguistics. August, Geneva.
  15. https://pythonhosted.org/Tashaphyne/Tashaphyne.normalize-module.html
  16. Kazem T., Rania E., and Je.rey C., 2005.Arabic Stemming Without A Root Dictionary. Information Science Research Institute, USA.
  17. Kreaa A., Ahmad A. and KabalanK.,2014 . ARABIC WORDS STEMMING APPROACH USING ARABIC ORDNET.International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.6, November 2014.
  18. https://pypi.python.org/pypi/Tashaphyne/
  19. Pu W,.Liu.N,2007.Local Word Bag Model for Text Categorization.Seventh IEEE International Conference on Data Mining,625-630.
  20. Abuaiadah D.,El-Sana J.,Abusalah W. 2014.On the Impact of Dataset Characteristics on ArabicDocument Classification. International Journal of Computer Applications (0975 – 8887)Volume 101– No.7.
  21. https://sites.google.com/site/mouradabbas9/corpora
  22. Wahbeh A., Al-Kabi M., Al-Radaidah Q., AlShawakfa E. and Alsamdi. I. 2011. The Effect of Stemming on Arabic Text Classification: An Empirical Study. In International Journal of Information Retrieval Research (IJIRR), vol. 1, no. 3, I. 2011,54-70.
  23. M. Turk, and A. Pentland.1991.Eigenfaces for recognition. Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71 -86.
  24. https://www.python.org/downloads/
  25. GarnesO.,2009. Feature Selection for TextCategorization.Master thesis,Norwegian University of Science and Technology, June 2009.
  26. Sawaf H., Zaplo J., and Ney H.,2001.Statistical Classification Methods for Arabic News Articles.,Workshop on Arabic Natural Language Processing, ACL'01, Toulouse, France, July 2001.
  27. Yang Y. Liu and X.,1999.A Re-examination of Text Categorization Methods.,Proceedings of 22nd ACM International Conference on Research and Development in Information Retrieval,SIGIR’99, ACM Press, New York, USA, 1999, 42-49.
  28. http://www.nltk.org/_modules/nltk/stem/isri.html
  29. Elhassan R., Ahmed M.2015.Arabic Text Classification on Full Word.International Journal of Computer Science and Software Engineering (IJCSSE), Volume 4, Issue 5, May 201 5, 114-120.
Index Terms

Computer Science
Information Sciences

Keywords

Arabic text categorization Frequency ratio accumulation method (FRAM) Bag-Of-Word (BOW) Features selection Term and document frequency.