International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 186 - Number 42 |
Year of Publication: 2024 |
Authors: B. Ravikiran, Srinivasu Badugu |
10.5120/ijca2024924040 |
B. Ravikiran, Srinivasu Badugu . Sarcasm Detection in Telugu Language Text using Distinct Machine Learning Classification Algorithms. International Journal of Computer Applications. 186, 42 ( Sep 2024), 28-35. DOI=10.5120/ijca2024924040
Sarcasm detection is a growing field in Natural Language Processing (NLP). Sarcasm is identified using positive or more increased positive words, often with a negative connotation, to insult or mock others. In sentiment analysis, detecting sarcasm in the text has become critical. They reviewed numerous relevant research articles, but due to the telugu language's limited resources, detecting sarcasm in telugu language texts remains challenging. As a result, the sentiment detection model struggles to accurately identify the exact sentiment of a sarcastic statement, necessitating the development of an automated sarcasm detection system. Many researchers have trained and tested various machine learning classification algorithms to identify sarcasm, but these algorithms require a dataset as input, which often contains noise. The dataset undergoes various preprocessing techniques to eliminate noise. Gathered a Telugu conversational dataset from the Kaggle repository, developed their dataset called the Telugu News Headline dataset, labeled the statements as sarcastic or non-sarcastic by the annotators, and then input them into the proposed model. Built the proposed model using SVM (Support Vector Machine), NB (Naive Bayes), and LR (Logistic Regression) and utilized One Hot Encoding (OHE) to transform the dataset into vectors, then fed to the Sarcasm Detection Model to determine the model accuracy. It is trained and tested the Sarcasm detection model on positive or even more positive sentences with 60:40, 70:30, 80:20, and 90:10 splitting ratios to enhance the model performance. By considering the base 70:30 split ratio the best of three algorithms, Logistic Regression resulted in accuracy rates of 65.89% on the imbalanced Telugu conversational dataset and 67.01% on the balanced Telugu conversational dataset. Logistic Regression resulted in accuracy rates of 90.07% on the imbalanced Telugu news headline dataset, and SVM resulted in an accuracy of 98.35% on the balanced Telugu conversational dataset. It is observed that Logistic Regression had better accuracy on the imbalanced and balanced Telugu conversational dataset and the imbalanced Telugu news headline dataset, whereas on the balanced Telugu news headline dataset, SVM had good accuracy. In the future, it can be applied deep learning algorithms to detect sarcasm for better accuracy.