We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

A Literature Review of Bangla Document Clustering

by Arefin Niam, Avijit Das, Mahruba Sharmin Chowdhury, Mohammad Abdullah Al Mumin
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 175 - Number 19
Year of Publication: 2020
Authors: Arefin Niam, Avijit Das, Mahruba Sharmin Chowdhury, Mohammad Abdullah Al Mumin
10.5120/ijca2020920716

Arefin Niam, Avijit Das, Mahruba Sharmin Chowdhury, Mohammad Abdullah Al Mumin . A Literature Review of Bangla Document Clustering. International Journal of Computer Applications. 175, 19 ( Sep 2020), 28-35. DOI=10.5120/ijca2020920716

@article{ 10.5120/ijca2020920716,
author = { Arefin Niam, Avijit Das, Mahruba Sharmin Chowdhury, Mohammad Abdullah Al Mumin },
title = { A Literature Review of Bangla Document Clustering },
journal = { International Journal of Computer Applications },
issue_date = { Sep 2020 },
volume = { 175 },
number = { 19 },
month = { Sep },
year = { 2020 },
issn = { 0975-8887 },
pages = { 28-35 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume175/number19/31561-2020920716/ },
doi = { 10.5120/ijca2020920716 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:25:29.905615+05:30
%A Arefin Niam
%A Avijit Das
%A Mahruba Sharmin Chowdhury
%A Mohammad Abdullah Al Mumin
%T A Literature Review of Bangla Document Clustering
%J International Journal of Computer Applications
%@ 0975-8887
%V 175
%N 19
%P 28-35
%D 2020
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Document clustering is a machine learning approach to categorize documents into related groups without any definition to the documents prior to the process. It helps to categorize very large chunks of documents into similar categories for making the process of finding a particular document easier. It also helps in retrieval of the data. There has been numerous works in document clustering in other languages but the amount of work in Bangla is still not sufficient. In this paper it has been aimed to evaluate the techniques that have been adopted in clustering Bangla documents. These techniques and their effectiveness has also been compared in contrast to the contemporary methods adopted by researchers around the world on other languages and a vision is proposed on current state of development in Bangla Document Clustering.

References
  1. Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Engineering Applications of Artificial Intelligence, 73:111–125, 2018.
  2. Laith Mohammad Abualigah, Ahamad Tajudin Khader, and Essam Said Hanandeh. A new feature selection method to im-prove the document clustering using particle swarm optimiza-tion algorithm. Journal of Computational Science, 25:456– 466, 2018.
  3. Adnan Ahmad, Md Ruhul Amin, and Farida Chowdhury. Bengali document clustering using word movers distance. In 2018 International Conference on Bangla Speech and Lan-guage Processing (ICBSLP), pages 1–6. IEEE, 2018.
  4. Elie Aljalbout, Vladimir Golkov, Yawar Siddiqui, Maxim-ilian Strobel, and Daniel Cremers. Clustering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648, 2018.
  5. Nicholas O Andrews and Edward A Fox. Recent devel-opments in document clustering. Technical report, Depart-ment of Computer Science, Virginia Polytechnic Institute & State . . . , 2007.
  6. Fabio Benedetti, Domenico Beneventano, Sonia Bergam-aschi, and Giovanni Simonini. Computing inter-document similarity with context semantic analysis. Information Sys-tems, 80:136–147, 2019.
  7. Tanmoy Chakraborty, Dipankar Das, and Sivaji Bandyopad-hyay. Semantic clustering: an attempt to identify multiword expressions in bengali. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, pages 8–13, 2011.
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Encyclopedia Britannica, 2019.
  10. Maziar Moradi Fard, Thibaut Thonet, and Eric Gaussier. Deep k-means: Jointly clustering with k-means and learning repre-sentations. Pattern Recognition Letters, 2020.
  11. Jasmine Irani, Nitin Pise, and Madhura Phatak. Clustering techniques and the similarity measures used in clustering: A survey. International journal of computer applications, 134(7):9–14, 2016.
  12. Md Islam, Fazla Elahi Md Jubayer, Syed Ikhtiar Ahmed, et al. A comparative study on different types of ap-proaches to bengali document categorization. arXiv preprint arXiv:1701.08694, 2017.
  13. P Jaganathan and S Jaiganesh. An improved k-means algo-rithm combined with particle swarm optimization approach for efficient web document clustering. In 2013 International Conference on Green Computing, Communication and Con-servation of Energy (ICGCE), pages 772–776. IEEE, 2013.
  14. Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000), 2000.
  15. Barbara Kitchenham and Stuart Charters. Guidelines for per-forming systematic literature reviews in software engineering. 2007.
  16. Yutong Li, Juanjuan Cai, and Jingling Wang. A text docu-ment clustering method based on weighted bert model. In 2020 IEEE 4th Information Technology, Networking, Elec-tronic and Automation Control Conference (ITNEC), vol-ume 1, pages 1426–1430. IEEE, 2020.
  17. Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze¨. Introduction to information retrieval. Natural Lan-guage Engineering, 16(1):100–103, 2010.
  18. Munirul Mansur. Analysis of n-gram based text categoriza-tion for bangla in a newspaper corpus. PhD thesis, BRAC University, 2006.
  19. Anil Kumar Patidar, Jitendra Agrawal, and Nishchol Mishra. Analysis of different similarity measure functions and their impacts on shared nearest neighbor clustering approach. In-ternational Journal of Computer Applications, 40(16):1–5, 2012.
  20. Zakia Sultana Ritu, Nafisa Nowshin, Md Mahadi Hasan Nahid, and Sabir Ismail. Performance analysis of different word embedding models on bangla language.I In 2018 Inter-national Conference on Bangla Speech and Language Pro-cessing (ICBSLP), pages 1–5. IEEE, 2018.
  21. Claude Sammut and Geoffrey I Webb. Encyclopedia of ma-chine learning. Springer Science & Business Media, 2011.
  22. Neepa Shah and Sunita Mahajan. Document clustering: a de-tailed review. International Journal of Applied Information Systems, 4(5):30–38, 2012.
  23. Omid Shahmirzadi, Adam Lugowski, and Kenneth Younge. Text similarity in vector space models: a comparative study. In 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 659–666. IEEE, 2019.
  24. Manjira Sinha, Tirthankar Dasgupta, Abhik Jana, and Anu-pam Basu. Design and development of a bangla semantic lex-icon and semantic similarity measure. International Journal of Computer Applications, 975:8887, 2014.
  25. Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In international conference on ma-chine learning, pages 3861–3870, 2017.
Index Terms

Computer Science
Information Sciences

Keywords

Data Mining Document Clustering Information Retrieval Text Mining.