International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 183 - Number 21 |
Year of Publication: 2021 |
Authors: Maruf Ahmed Mridul, Arnab Sen Sharma |
10.5120/ijca2021921582 |
Maruf Ahmed Mridul, Arnab Sen Sharma . A Simple and Efficient Framework for Sentence Similarity Measurement in Bengali Language. International Journal of Computer Applications. 183, 21 ( Aug 2021), 1-7. DOI=10.5120/ijca2021921582
Sentence similarity measurement is a crucial task for the performance of several Natural Language Processing applications and it has received much attention mainly for English language. However, for low resource languages like Bengali, very few works have been done in this field. This article proposes a simple approach to measure sentence similarity score for low resource languages. Rather than relying on complex approaches that try to extract lexical information from text, here, semantic information using language-agnostic language models based on BERT is extracted. The variable length pairs of sentences are embedded into fixed length feature vectors using different language-agnostic BERT sentence encoders, then their differences are measured using some standard loss functions and finally the concatenated loss vectors are used to train a simple feed forward neural network to measure the similarity score between sentence pairs. The experiments show that this relatively simple approach gives satisfactory results when trained with Bengali sentence pairs. This approach requires almost no intricate pre-processing steps. Which means a similar architecture should work well for other low resources languages for which well performing stemmers, lemmatizers etc are scarce.