CFP last date
20 March 2025
Reseach Article

A Comparative Study on different Machine Learning Approaches for Categorizing Bangla Documents

by Abu Jafar Md Jakaria, Rajarshi Roy Chowdhury, Jaima Jaman Konia, Debashish Roy, Nishat Tasnim Ahmed Meem
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 61
Year of Publication: 2025
Authors: Abu Jafar Md Jakaria, Rajarshi Roy Chowdhury, Jaima Jaman Konia, Debashish Roy, Nishat Tasnim Ahmed Meem
10.5120/ijca2025924391

Abu Jafar Md Jakaria, Rajarshi Roy Chowdhury, Jaima Jaman Konia, Debashish Roy, Nishat Tasnim Ahmed Meem . A Comparative Study on different Machine Learning Approaches for Categorizing Bangla Documents. International Journal of Computer Applications. 186, 61 ( Jan 2025), 32-39. DOI=10.5120/ijca2025924391

@article{ 10.5120/ijca2025924391,
author = { Abu Jafar Md Jakaria, Rajarshi Roy Chowdhury, Jaima Jaman Konia, Debashish Roy, Nishat Tasnim Ahmed Meem },
title = { A Comparative Study on different Machine Learning Approaches for Categorizing Bangla Documents },
journal = { International Journal of Computer Applications },
issue_date = { Jan 2025 },
volume = { 186 },
number = { 61 },
month = { Jan },
year = { 2025 },
issn = { 0975-8887 },
pages = { 32-39 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number61/a-comparative-study-on-different-machine-learning-approaches-for-categorizing-bangla-documents/ },
doi = { 10.5120/ijca2025924391 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2025-01-28T19:07:03.930021+05:30
%A Abu Jafar Md Jakaria
%A Rajarshi Roy Chowdhury
%A Jaima Jaman Konia
%A Debashish Roy
%A Nishat Tasnim Ahmed Meem
%T A Comparative Study on different Machine Learning Approaches for Categorizing Bangla Documents
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 61
%P 32-39
%D 2025
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document categorization (DC) is a pivotal technique employed to efficiently ascertain the category of a document within a reasonable timeframe. It is essential for efficient information retrieval, organization, and analysis, which enables quick identification of relevant documents, facilitates effective search functionality, and streamlines decision-making processes. In this paper, a comparative analysis of nine well-known supervised machine learning (ML) approaches, including random forest (RF), k-nearest neighbors (KNN), support vector machine (SVM), decision tree (DT), bernoulli naïve-bayes (BNB), complement naïve-bayes (CNB), multinomial naïve-bayes (MNB), bagging (BC), and logistic regression (LR), is presented, demonstrating how each algorithm performs on various metrics for automatic Bengali document categorization, thereby highlighting significant differences in their classification accuracy and computational efficiency. Feature selection plays a crucial role in enhancing classification performances alongside the choice of classifier. Normalized term frequency-inverse document frequency (TF-IDF) is utilized to systematically evaluate the effectiveness of various classification techniques across eight distinct categories, highlighting the significant impact of the feature optimization approach. Experimental results have shown that SVM, BC, and LR exhibited significantly higher accuracy than the other methods, with SVM achieving 92.76%, BC reaching 92.64%, and LR attaining 92.26%, respectively, when tested on the Bangla newspaper dataset, highlighting their superior performance in automatic document categorization within this context. These findings underscore the effectiveness of SVM, BC, and LR in the context of Bengali document categorization, demonstrating their ability to consistently deliver high accuracy rates.

References
  1. M. Z. Afzal et al., “Deepdocclassifier: Document classification with deep Convolutional Neural Network,” in 13th international conference on document analysis and recognition, IEEE, 2015, pp. 1111–1115.
  2. H. Borko and M. Bernick, “Automatic Document Classification,” Journal of the ACM, vol. 10, no. 2, pp. 151–162, 1963.
  3. D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: state of the art, current trends and challenges,” Multimed Tools Appl, vol. 82, no. 3, pp. 3713–3744, Jan. 2023, doi: 10.1007/s11042-022-13428-4.
  4. Q. Ouyang, J. Tian, and J. Wei, “E-mail Spam Classification using KNN and Naive Bayes,” 2023.
  5. R. Evans, D. Jackson, and J. Murphy, “Google News and Machine Gatekeepers: Algorithmic Personalisation and News Diversity in Online News Search,” Digital Journalism, vol. 11, no. 9, pp. 1682–1700, 2023, doi: 10.1080/21670811.2022.2055596.
  6. C.-H. CHAN Aixin SUN, E. Peng LIM, E. Peng, and C.-H. Chan Aixin Sun Ee-Peng Lim, “Automated online news classification with personalization,” 2001. [Online]. Available: https://ink.library.smu.edu.sg/sis_research/913
  7. P. Melville, V. Sindhwani, and R. D. Lawrence, “Social Media Analytics: Channeling the Power of the Blogosphere for Marketing Insight,” Proc. of the WIN, vol. 1, no. 1, pp. 1–5, 2009, [Online]. Available: http://www.universalmccann.com
  8. H. Abburi, M. Suesserman, N. Pudota, B. Veeramani, E. Bowen, and S. Bhattacharya, “Generative AI Text Classification using Ensemble LLM Approaches,” Sep. 2023, [Online]. Available: http://arxiv.org/abs/2309.07755
  9. Y. Wei, “Chinese and English text classification techniques incorporating CHI feature selection for ELT cloud classroom,” Open Computer Science, vol. 14, no. 1, Jan. 2024, doi: 10.1515/comp-2024-0007.
  10. A. K. Mandal and R. Sen, “Supervised Learning Methods for Bangla Web Document Categorization,” International Journal of Artificial Intelligence & Applications, vol. 5, no. 5, pp. 93–105, Sep. 2014, doi: 10.5121/ijaia.2014.5508.
  11. M. Habibullah, M. S. Islam, F. T. Jahura, and J. Biswas, “Bangla Document Classification Based on Machine Learning and Explainable NLP,” 2023 6th International Conference on Electrical Information and Communication Technology, EICT 2023, 2023, doi: 10.1109/EICT61409.2023.10427766.
  12. R. R. Chowdhury, A. C. Idris, and P. E. Abas, “Internet of Things Device Classification using Transport and Network Layers Communication Traffic Traces,” International Journal of Computing and Digital Systems, vol. 12, no. 1, pp. 545–555, 2022, doi: 10.12785/ijcds/120144.
  13. R. R. Chowdhury, A. C. Idris, and P. E. Abas, “Internet of things: Digital footprints carry a device identity,” in AIP Conference Proceedings 2643, 2023, p. 40003. doi: 10.1063/5.0111335.
  14. R. R. Chowdhury, S. Aneja, N. Aneja, and E. Abas, “Network Traffic Analysis based IoT Device Identification,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Aug. 2020, pp. 79–89. doi: 10.1145/3421537.3421545.
  15. R. R. Chowdhury, A. C. Idris, and P. E. Abas, “Identifying SH-IoT devices from network traffic characteristics using random forest classifier,” Wireless Networks, 2023, doi: 10.1007/s11276-023-03478-3.
  16. R. R. Chowdhury and P. E. Abas, “A survey on device fingerprinting approach for resource-constraint IoT devices: Comparative study and research challenges,” Nov. 01, 2022, Elsevier B.V. doi: 10.1016/j.iot.2022.100632.
  17. M. Miettinen, S. Marchal, and N. Asokan, “IoT Sentinel: Automated Device-Type Identification for Security Enforcement in IoT,” 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 2177–2184, 2017, doi: 10.1109/ICDCS.2017.284.
  18. M. Hasan, Md. M. Islam, M. I. I. Zarif, and M. M. A. Hashem, “Attack and anomaly detection in IoT sensors in IoT sites using machine learning approaches,” Internet of Things, vol. 7, p. 100059, Sep. 2019, doi: 10.1016/j.iot.2019.100059.
  19. H. A. Khattak, M. A. Shah, S. Khan, I. Ali, and M. Imran, “Perception layer security in Internet of Things,” Future Generation Computer Systems, vol. 100, pp. 144–164, 2019, doi: 10.1016/j.future.2019.04.038.
  20. D. Roy, R. R. Chowdhury, A. Bin Nasser, A. Azmi, and M. Babaeianjelodar, “Item recommendation using user feedback data and item profile,” in AIP Conference Proceedings 2643, 2023, p. 40008. doi: 10.1063/5.0111349.
  21. Md. S. Azam, A. Rahman, S. M. H. S. Iqbal, and Md. T. Ahmed, “Prediction of Liver Diseases by Using Few Machine Learning Based Approaches,” Australian Journal of Engineering and Innovative Technology, pp. 85–90, Oct. 2020, doi: 10.34104/ajeit.020.085090.
  22. M. Kumar, S. K. Khatri, and M. Mohammadian, “Breast Cancer Classification Approaches - A Comparative Analysis,” Journal of Information Systems and Telecommunication, vol. 11, no. 1, pp. 1–11, Apr. 2022.
  23. Z. Al Nazi, “Bangla Newspaper Dataset,” Kaggle. Accessed: Sep. 02, 2024. [Online]. Available: https://www.kaggle.com/datasets/furcifer/bangla-newspaper-dataset
  24. S. Qaiser and R. Ali, “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents,” Int J Comput Appl, vol. 181, no. 1, pp. 25–29, Jul. 2018, doi: 10.5120/ijca2018917395.
  25. V. Bijalwan, V. Kumar, P. Kumari, and J. Pascual, “KNN based machine learning approach for text and document mining,” International Journal of Database Theory and Application, vol. 7, no. 1, pp. 61–70, Jan. 2014, doi: 10.14257/ijdta.2014.7.1.06.
  26. M. Alhawarat and A. O. Aseeri, “A Superior Arabic Text Categorization Deep Model (SATCDM),” IEEE Access, vol. 8, pp. 24653–24661, 2020, doi: 10.1109/ACCESS.2020.2970504.
  27. H. M. Noaman, S. Elmougy, A. Ghoneim, and T. Hamza, “Naive Bayes Classifier based Arabic document categorization,” in The 7th International Conference on Informatics and Systems (INFOS), 2010. Accessed: Sep. 02, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/5461819/authors#authors
  28. F. Peng, D. Schuurmans, and S. Wang, “Language and Task Independent Text Categorization with Simple Language Models,” in Proceedings of HLT-NAACL, 2003, pp. 110–117.
  29. S. Puri and S. P. Singh, “Hindi Text Document Classification System Using SVM and Fuzzy,” International Journal of Rough Sets and Data Analysis, vol. 5, no. 4, pp. 1–31, Sep. 2018, doi: 10.4018/ijrsda.2018100101.
  30. C. H. A. Koster and J. G. Beney, “Phrase-based Document Categorization revisited,” in PaIR ’09: Proceedings of the 2nd international workshop on Patent information retrieval, ACM Digital Library, 2009, pp. 49–56.
  31. P. Mcnamee and J. Mayfield, “Character N-Gram Tokenization for European Language Text Retrieval,” 2004.
  32. V. Tam, A. Santoso, and R. Setiono, “A Comparative Study of Centroid-Based, Neighborhood-Based and Statistical Approaches for Effective Document Categorization,” in International Conference on Pattern Recognition, IEEE, Aug. 2002.
  33. Z. Liu, X. Lv, K. Liu, and S. Shi, “Study on SVM compared with the other text classification methods,” in 2nd International Workshop on Education Technology and Computer Science, ETCS 2010, 2010, pp. 219–222. doi: 10.1109/ETCS.2010.248.
  34. T. S. Zakzouk and H. I. Mathkour, “Comparing text classifiers for sports news,” Procedia Technology, vol. 1, pp. 474–480, 2012, doi: 10.1016/j.protcy.2012.02.104.
  35. M. S. Islam, F. Elahi, M. Jubayer, and S. I. Ahmed, “A Comparative Study on Different Types of Approaches to Bengali document Categorization,” in International Conference on Engineering Research, Innovation and Education, Jan. 2017. [Online]. Available: http://prothom-alo.com,
  36. M. R. Hossain and M. M. Hoque, “Automatic Bengali Document Categorization Based on Word Embedding and Statistical Learning Approaches,” in International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2), IEEE, Feb. 2018.
  37. F. Quadery, A. Al Maruf, and T. Ahmed, “Semi Supervised Keyword Based Bengali Document Categorization,” in 3rd International Conference on Electrical Engineering and Information & Communication Technology, IEEE, Sep. 2017, p. 139. doi: 10.1109/CEEICT.2016.7873040.
  38. T. Verdonck, B. Baesens, M. Óskarsdóttir, and S. vanden Broucke, “Special issue on feature engineering editorial,” Mach Learn, vol. 113, no. 7, pp. 3917–3928, Jul. 2024, doi: 10.1007/s10994-021-06042-2.
  39. J. Beel, B. Gipp, S. Langer, and C. Breitinger, “Research-paper recommender systems: a literature survey,” International Journal on Digital Libraries, vol. 17, no. 4, pp. 305–338, Nov. 2016, doi: 10.1007/s00799-015-0156-0.
  40. F. Mozaffari, I. R. Vanani, P. Mahmoudian, and B. Sohrabi, “Application of Machine Learning in the Telecommunications Industry-Partial Churn Prediction by using a Hybrid Feature Selection Approach,” Journal of Information Systems and Telecommunication, vol. 11, no. 4, pp. 331–346, Mar. 2023.
  41. K. Jindal and R. Aron, “A Hybrid Machine Learning Approach for Sentiment Analysis of Beauty Products Reviews,” Journal of Information Systems and Telecommunication, vol. 10, no. 37, pp. 1–10, Dec. 2022, doi: 10.52547/jist.15586.10.37.1.
  42. B. Trstenjak, S. Mikac, and D. Donko, “KNN with TF-IDF based framework for text categorization,” in Procedia Engineering, Elsevier Ltd, 2014, pp. 1356–1364. doi: 10.1016/j.proeng.2014.03.129.
  43. S. Jiang, G. Pang, M. Wu, and L. Kuang, “An improved K-nearest-neighbor algorithm for text categorization,” Expert Syst Appl, vol. 39, no. 1, pp. 1503–1509, Jan. 2012, doi: 10.1016/J.ESWA.2011.08.040.
  44. G. Singh, B. Kumar, L. Gaur, and A. Tyagi, “Comparison between Multinomial and Bernoulli Naïve Bayes for Text Classification,” in International Conference on Automation, Computational and Technology Management (ICACTM), IEEE, 2019.
  45. B. Seref and E. Bostanci, “Performance comparison of naïve bayes and complement naïve bayes algorithms,” in Proceedings - 2019 6th International Conference on Electrical and Electronics Engineering, ICEEE 2019, Institute of Electrical and Electronics Engineers Inc., Apr. 2019, pp. 131–138. doi: 10.1109/ICEEE2019.2019.00033.
  46. M. Abbas, K. Ali, A. Jamali, K. Ali Memon, and A. Aleem Jamali, “Multinomial Naive Bayes Classification Model for Sentiment Analysis,” IJCSNS International Journal of Computer Science and Network Security, vol. 19, no. 3, p. 62, 2019, doi: 10.13140/RG.2.2.30021.40169.
  47. S. Ghosh, A. Dasgupta, and A. Swetapadma, “A Study on Support Vector Machine based Linear and Non-Linear Pattern Classification,” in International Conference on Intelligent Sustainable Systems (ICISS), IEEE, 2019.
  48. B. Charbuty and A. Abdulazeez, “Classification Based on Decision Tree Algorithm for Machine Learning,” Journal of Applied Science and Technology Trends, vol. 2, no. 01, pp. 20–28, Mar. 2021, doi: 10.38094/jastt20165.
  49. S. Suthaharan, Machine Learning Models and Algorithms for Big Data Classification, vol. 36. in Integrated Series in Information Systems, vol. 36. Boston, MA: Springer US, 2016. doi: 10.1007/978-1-4899-7641-3.
  50. L. Breiman, “Random forests,” Mach Learn, vol. 45, no. 1, pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.
  51. J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random Forests and Decision Trees,” International Journal of Computer Science Issues, vol. 9, no. 5, pp. 272–278, 2012.
  52. S. Waskle, L. Parashar, and U. Singh, “Intrusion Detection System Using PCA with Random Forest Approach,” in 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), 2020, pp. 803–808.
  53. A. K. Mishra and B. K. Ratha, “Study of Random Tree and Random Forest Data Mining Algorithms for Microarray Data Analysis,” International Journal on Advanced Electrical and Computer Engineering (IJAECE), vol. 3, no. 4, pp. 5–7, 2016, Accessed: Jun. 18, 2021. [Online]. Available: http://www.irdindia.in/journal_ijaece/pdf/vol3_iss4/2.pdf
  54. R. Jehad and S. A.Yousif, “Fake News Classification Using Random Forest and Decision Tree (J48),” Al-Nahrain Journal of Science, vol. 23, no. 4, pp. 49–55, Dec. 2020, doi: 10.22401/ANJS.23.4.09.
  55. R. R. Chowdhury, A. C. Idris, and P. E. Abas, “Device identification using optimized digital footprints,” IAES International Journal of Artificial Intelligence, vol. 12, no. 1, pp. 232–240, Mar. 2023, doi: 10.11591/ijai.v12.i1.pp232-240.
  56. L. Bbeiman, “Bagging Predictors,” 1996.
  57. C. D. Sutton, “Classification and Regression Trees, Bagging, and Boosting,” 2005, Elsevier. doi: 10.1016/S0169-7161(04)24011-1.
Index Terms

Computer Science
Information Sciences

Keywords

Document Categorization Machine Learning Term Frequency-Inverse Document Frequency Bengali Document Natural Language Processing