Enhancing Audio Classification with a CNN-Attention Model: Robust Performance and Resilience Against Backdoor Attacks

Syed Murtoza Mushrul Pasha; Shahidur Rahoman Sohag; Muhammad Mahin Ali

Call for Paper

November Edition

IJCA solicits high quality original research papers for the upcoming November edition of the journal. The last date of research paper submission is 20 October 2025

Submit your paper

Know more

The week's pick

Zero Trust Architecture Implementation in Enterprise Networks: Evaluating Effectiveness Against Cyber Threats

Stephen Kofi Dotse Samuel Yao Sebuabe Augustus Obeng Silas Asani Abudu Edna Awisie Pappoe

Random Articles

Reseach Article

Enhancing Audio Classification with a CNN-Attention Model: Robust Performance and Resilience Against Backdoor Attacks

by Syed Murtoza Mushrul Pasha, Shahidur Rahoman Sohag, Muhammad Mahin Ali

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 186 - Number 49

Year of Publication: 2024

Authors: Syed Murtoza Mushrul Pasha, Shahidur Rahoman Sohag, Muhammad Mahin Ali

10.5120/ijca2024924154

Syed Murtoza Mushrul Pasha, Shahidur Rahoman Sohag, Muhammad Mahin Ali . Enhancing Audio Classification with a CNN-Attention Model: Robust Performance and Resilience Against Backdoor Attacks. International Journal of Computer Applications. 186, 49 ( Nov 2024), 26-33. DOI=10.5120/ijca2024924154

@article{ 10.5120/ijca2024924154,

author = { Syed Murtoza Mushrul Pasha, Shahidur Rahoman Sohag, Muhammad Mahin Ali },

title = { Enhancing Audio Classification with a CNN-Attention Model: Robust Performance and Resilience Against Backdoor Attacks },

journal = { International Journal of Computer Applications },

issue_date = { Nov 2024 },

volume = { 186 },

number = { 49 },

month = { Nov },

year = { 2024 },

issn = { 0975-8887 },

pages = { 26-33 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume186/number49/enhancing-audio-classification-with-a-cnn-attention-model-robust-performance-and-resilience-against-backdoor-attacks/ },

doi = { 10.5120/ijca2024924154 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-11-27T00:39:32.272794+05:30

%A Syed Murtoza Mushrul Pasha

%A Shahidur Rahoman Sohag

%A Muhammad Mahin Ali

%T Enhancing Audio Classification with a CNN-Attention Model: Robust Performance and Resilience Against Backdoor Attacks

%J International Journal of Computer Applications

%@ 0975-8887

%V 186

%N 49

%P 26-33

%D 2024

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Audio classification plays a vital role in diverse fields such as communication, medical diagnostics, and forensic analysis, where accurate and reliable processing of audio signals is critical. This study presents a Convolutional Neural Network (CNN)-Attention framework designed to enhance performance and robustness in audio classification, addressing challenges such as adversarial threats, including backdoor attacks, which compromise model reliability. The framework achieves notable improvements in classification accuracy, demonstrating up to 43.16% higher accuracy compared to traditional CNN models when evaluated on benchmark datasets such as UrbanSound8K, FSDKaggle2018, and ESC-50. Additionally, the framework achieves a peak accuracy of 98.41% on the UrbanSound8K dataset, underscoring its exceptional performance in real-world scenarios. Alongside its superior classification performance, the system exhibits strong resilience against adversarial attacks, maintaining the integrity and reliability of predictions under challenging conditions. By integrating attention mechanisms and leveraging advanced data augmentation techniques like time-stretching and pitch-shifting, the framework significantly improves testing accuracy by 9.74%, 33.53%, and 43.16% across the three datasets, respectively. These advancements highlight its potential to effectively process and analysis audio data across various environments. This framework demonstrates its significance in applications demanding exceptional reliability and precision, establishing a benchmark for audio classification tasks across vital domains, including environmental monitoring, assistive technologies, and intelligent surveillance systems.

References

M. S. Imran, A. F. Rahman, S. Tanvir, H. H. Kadir, J. Iqbal, and M. Mostakim, “An analysis of audio classification techniques using deep learning architectures,” in 2021 6th International Conference on Inventive Computation Technologies (ICICT), pp. 805–812, 2021.
K. O’Shea and R. Nash, “An introduction to convolutional neural networks,” 2015.
R. M. Schmidt, “Recurrent neural networks (rnns): A gentle introduction and overview,” 2019.
N. Aburaed, A. Panthakkan, M. Al-Saad, S. A. Amin, and W. Mansoor, “Deep convolutional neural network (dcnn) for skin cancer classification,” in 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp. 1–4, 2020.
K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, 2015.
C. Ji, T. B. Mudiyanselage, Y. Gao, and Y. Pan, “A review of infant cry analysis and classification,” EURASIP J. Audio Speech Music Process., vol. 2021, feb 2021.
L. Prananta, B. M. Halpern, S. Feng, and O. Scharenborg, “The effectiveness of time stretching for enhancing dysarthric speech for improved dysarthric speech recognition,” 2022.
M. Morrison, Z. Jin, N. J. Bryan, J.-P. Caceres, and B. Pardo, “Neural pitch-shifting and time-stretching with controllable lpcnet,” 2021.
L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” 2010.
R. C. Staudemeyer and E. R. Morris, “Understanding lstm – a tutorial into long short-term memory recurrent neural networks,” 2019.
A. Abeysinghe, S. Tohmuang, J. L. Davy, and M. Fard, “Data augmentation on convolutional neural networks to classify mechanical noise,” Applied Acoustics, vol. 203, p. 109209, 2023.
D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, 2020.
E. O. Brigham and R. E. Morrow, “The fast fourier transform,” IEEE Spectrum, vol. 4, no. 12, pp. 63–70, 1967.
D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
S. R. Madikeri and H. A. Murthy, “Mel filter bank energy based slope feature and its application to speaker recognition,” in 2011 National Conference on Communications (NCC), pp. 1–4, 2011.
L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” 2010.
J. Zhu, C. Peng, B. Zhang, W. Jia, G. Xu, Y. Wu, Z. Hu, and M. Zhu, “An improved background normalization algorithm for noise resilience in low frequency,” Journal of Marine Science and Engineering, vol. 9, no. 8, 2021.
R. Oshana, “Overview of digital signal processing algorithms,” DSP Software Development Techniques for Embedded and Real-Time Systems, pp. 59–121, 2006.
N. Ahmed, T. Natarajan, and K. Rao, “Discrete cosine transform,” IEEE Transactions on Computers, vol. C23, no. 1, pp. 90–93, 1974.
C. Banerjee, T. Mukherjee, and E. Pasiliao, “An empirical study on generalizations of the relu activation function,” in Proceedings of the 2019 ACM Southeast Conference, ACM SE ’19, (New York, NY, USA), p. 164–167, Association for Computing Machinery, 2019.
A. Mao, M. Mohri, and Y. Zhong, “Cross-entropy loss functions: Theoretical analysis and applications,” 2023.
E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: A platform for the creation of open audio datasets,”
K. J. Piczak, “Esc: Dataset for environmental sound classification,” MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, pp. 1015–1018, 10 2015.
R. G. Bruballa, “Understanding categorical cross-entropy loss, binary cross-entropy loss, softmax loss, logistic loss, focal loss and all those confusing names,” 2018. Accessed: 2024-07-18.
S. Adams, “Audio classification,” 2020. Available on-line: https://github.com/seth814/Audio-Classification, Accessed: 2024-07-18.
D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, 2020. M. S. Imran, “Audio classification,” 2020. Available on line: https://github.com/SafwatImran/Audio Classification, Accessed: 2024-07-18.

Index Terms

Computer Science

Information Sciences

Machine Learning Security

Keywords

Audio Classification CNN Attention Mechanism Data Augmentation Backdoor Attacks