International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 186 - Number 61 |
Year of Publication: 2025 |
Authors: Abu Jafar Md Jakaria, Rajarshi Roy Chowdhury, Jaima Jaman Konia, Debashish Roy, Nishat Tasnim Ahmed Meem |
![]() |
Abu Jafar Md Jakaria, Rajarshi Roy Chowdhury, Jaima Jaman Konia, Debashish Roy, Nishat Tasnim Ahmed Meem . A Comparative Study on different Machine Learning Approaches for Categorizing Bangla Documents. International Journal of Computer Applications. 186, 61 ( Jan 2025), 32-39. DOI=10.5120/ijca2025924391
Document categorization (DC) is a pivotal technique employed to efficiently ascertain the category of a document within a reasonable timeframe. It is essential for efficient information retrieval, organization, and analysis, which enables quick identification of relevant documents, facilitates effective search functionality, and streamlines decision-making processes. In this paper, a comparative analysis of nine well-known supervised machine learning (ML) approaches, including random forest (RF), k-nearest neighbors (KNN), support vector machine (SVM), decision tree (DT), bernoulli naïve-bayes (BNB), complement naïve-bayes (CNB), multinomial naïve-bayes (MNB), bagging (BC), and logistic regression (LR), is presented, demonstrating how each algorithm performs on various metrics for automatic Bengali document categorization, thereby highlighting significant differences in their classification accuracy and computational efficiency. Feature selection plays a crucial role in enhancing classification performances alongside the choice of classifier. Normalized term frequency-inverse document frequency (TF-IDF) is utilized to systematically evaluate the effectiveness of various classification techniques across eight distinct categories, highlighting the significant impact of the feature optimization approach. Experimental results have shown that SVM, BC, and LR exhibited significantly higher accuracy than the other methods, with SVM achieving 92.76%, BC reaching 92.64%, and LR attaining 92.26%, respectively, when tested on the Bangla newspaper dataset, highlighting their superior performance in automatic document categorization within this context. These findings underscore the effectiveness of SVM, BC, and LR in the context of Bengali document categorization, demonstrating their ability to consistently deliver high accuracy rates.