International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 183 - Number 9 |
Year of Publication: 2021 |
Authors: Rudradityo Saha, G. Bharadwaja Kumar |
10.5120/ijca2021921389 |
Rudradityo Saha, G. Bharadwaja Kumar . A Novel Approach for Developing Paraphrase Detection System using Machine Learning. International Journal of Computer Applications. 183, 9 ( Jun 2021), 29-36. DOI=10.5120/ijca2021921389
Plagiarism detection is difficult since there can be changes made to a sentence at several levels, namely, lexical, semantic, and syntactic level, to construct a paraphrased or plagiarized sentence posing as original. To identify cases of plagiarism and hence discourage the same, this paper presents a novel Supervised Machine Learning based Paraphrase Detection System developed by conducting experiments using Microsoft Research Paraphrase (MSRP) Corpus and assessed on the same. The proposed paraphrase detection system has achieved comparable performance with existing paraphrase detection systems. The major contributions of this paper are the utilization of a unique combination of lexical, semantic, and syntactic features, utilization of Shapley Additive Explanations (SHAP) Feature Importance Plots in XGBoost, and application of a soft voting classifier comprising of the top 3 performing standalone machine learning classifiers on the training dataset of MSRP Corpus. Another major contribution of the paper is the finding that applying data augmentation techniques degrades performance of machine learning classifiers.