Speech Emotion Recognition Combining Acoustic Features and Linguistic Information using Network Architecture

Saumyadeep Singh; Syed Wajahat Abbas Rizvi

Call for Paper

January Edition

IJCA solicits high quality original research papers for the upcoming January edition of the journal. The last date of research paper submission is 22 December 2025

Submit your paper

Know more

The week's pick

A Hybrid Transformer-CNN Framework with Early and Late Fusion for Robust Skin Lesion Classification

Raihan Tanvir

Random Articles

Reseach Article

Speech Emotion Recognition Combining Acoustic Features and Linguistic Information using Network Architecture

by Saumyadeep Singh, Syed Wajahat Abbas Rizvi

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 5

Year of Publication: 2025

Authors: Saumyadeep Singh, Syed Wajahat Abbas Rizvi

10.5120/ijca2025924860

Saumyadeep Singh, Syed Wajahat Abbas Rizvi . Speech Emotion Recognition Combining Acoustic Features and Linguistic Information using Network Architecture. International Journal of Computer Applications. 187, 5 ( May 2025), 30-34. DOI=10.5120/ijca2025924860

@article{ 10.5120/ijca2025924860,

author = { Saumyadeep Singh, Syed Wajahat Abbas Rizvi },

title = { Speech Emotion Recognition Combining Acoustic Features and Linguistic Information using Network Architecture },

journal = { International Journal of Computer Applications },

issue_date = { May 2025 },

volume = { 187 },

number = { 5 },

month = { May },

year = { 2025 },

issn = { 0975-8887 },

pages = { 30-34 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number5/speech-emotion-recognition-combining-acoustic-features-and-linguistic-information-using-network-architecture/ },

doi = { 10.5120/ijca2025924860 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2025-05-29T00:02:58.420990+05:30

%A Saumyadeep Singh

%A Syed Wajahat Abbas Rizvi

%T Speech Emotion Recognition Combining Acoustic Features and Linguistic Information using Network Architecture

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 5

%P 30-34

%D 2025

%I Foundation of Computer Science (FCS), NY, USA

Abstract

For more accurate speaker identification in emotion-driven human-robot interaction, we suggest a unique method for combining linguistic information and acoustic to improve automated speech recognition (ASR) performance. This study creates a model with two primary components and divides emotional states into seven distinct categories. Contour, pitch, and energy spectrum characteristics are important criteria for analysis in the first component, which focuses on emotion identification from audio information. Using emotional phrases, the second component uses linguistic information to identify emotions in conversational material. We investigate a number of classification techniques, such as neural networks, auxiliary vector machines, linear classifiers, and Gaussian mixture models, in order to assess the efficacy of our methodology. The accuracy with which these methods can categorize emotional states is the basis for their evaluation. Ultimately, a neural network is used to combine soft judgments from language and auditory models, guaranteeing a more thorough and reliable emotion identification system. Two corpora of emotional speech are used for training and validation in order to evaluate performance. When compared to models that just use individual variables, the results show that combining language and auditory information greatly improves the accuracy of emotion identification. Enhancing ASR reliability and maximizing human-robot interaction depend on improvements in speaker emotion recognition, which this development helps to achieve. We also go over how our strategy stacks up against other approaches, emphasizing quantifiable benefits from our integration approach. The results show how well our model can identify emotions in a variety of speech situations, opening the door for more sophisticated and sensitive speech recognition systems. This work advances the creation of more responsive and intuitive human-robot communication by improving emotion identification algorithms, which is important for applications in assistive technology, customer service, and healthcare.

References

Emotion recognition in human-robot interaction. Inf. Sci. 2020, 509, 150–163.
Hansen, J.H.; Cairns, D.A. Icarus: Source generator based real-time recognition of speech in noisy stressful and lombard effect environments. Speech Commun. 1995, 16, 391–422.
Koduru, A.; Valiveti, H.B.; Budati, A.K. Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol. 2020, 23, 45–55.
Zheng, W.; Zheng, W.; Zong, Y. Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition. Virtual Real. Intell. Hardw. 2021, 3, 65–75.
Schuller, B.; Rigoll, G.; Lang, M. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17–21 May 2004; pp. 577–580.
Spencer, C.; Koç, İ.A.; Suga, C.; Lee, A.; Dhareshwar, A.M.; Franzén, E.; Iozzo, M.; Morrison, G.; McKeown, G. A Comparison of Unimodal and Multimodal Measurements of Driver Stress in Real-World Driving Conditions; ACM: New York, NY, USA, 2020.
France, D.J.; Shiavi, R.G.; Silverman, S.; Silverman, M.; Wilkes, M. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomed. Eng. 2000, 47, 829–837.
Uddin, M.Z.; Nilsson, E.G. Emotion recognition using speech and neural structured learning to facilitate edge intelligence. Eng. Appl. Artif. Intell. 2020, 94, 103775.
Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812.
Fahad, M.S.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process. 2021, 110, 102951.
Jahangir, R.; Teh, Y.W.; Mujtaba, G.; Alroobaea, R.; Shaikh, Z.H.; Ali, I. Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion. Mach. Vis. Appl. 2022, 33, 41.12. Ayadi, M.E.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587.
Abdel-Hamid, O.; Mohamed, A.-R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545.
Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204.
Anvarjon, T.; Kwon, S. Deep-net: A lightweight CNN-based speech emotion recognition system using deep frequency features. Sensors 2020, 20, 5212.
Rybka, J.; Janicki, A. Comparison of speaker dependent and speaker independent emotion recognition. Int. J. Appl. Math. Comput. Sci. 2013, 23, 797–808.
Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 2020, 116, 56–76.
Zhang, S.; Tao, X.; Chuang, Y.; Zhao, X. Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 2021, 127, 73–81.
Pawar, M.D.; Kokate, R.D. Convolution neural network based automatic speech emotion recognition using Melfrequency Cepstrum coefficients. Multimed. Tools Appl. 2021, 80, 15563–15587.
Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control. 2020, 59, 101894.
Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875.
Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmad, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.; Baik, S.W. Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl. 2019, 78, 5571– 5589.
Er, M.B. A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features. IEEE Access 2020, 8, 221640–221653.
Nicholson, J.; Takahashi, K.; Nakatsu, R. Emotion recognition in speech using neural networks. Neural Comput. Appl. 2000, 9, 290–296.
Noroozi, F.; Sapiński, T.; Kamińska, D.; Anbarjafari, G. Vocal-based emotion recognition using random forests and decision tree. Int. J. Speech Technol. 2017, 20, 239–246.

Index Terms

Computer Science

Information Sciences

Keywords

Automatic Speech Recognition Emotion Recognition Human-Robot Interaction Acoustic Features Linguistic Information Neural Network