International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 187 - Number 14 |
Year of Publication: 2025 |
Authors: Sunakshi Mehra |
![]() |
Sunakshi Mehra . SPEAKNet: Spectrogram-Phoneme Embedding Architecture for Knowledge-enhanced Speech Command Recognition. International Journal of Computer Applications. 187, 14 ( Jun 2025), 27-37. DOI=10.5120/ijca2025925146
This research aims to enhance automatic speech recognition (ASR) by integrating multimodal data—specifically, text transcripts and Mel spectrograms generated from raw audio signals. The study explores the often-overlooked role of phonological features and spectrogram-based representations in improving the accuracy of spoken word recognition. A dual-path approach is adopted: EfficientNetV2 is utilized to extract features from spectrogram images, while a Speech2Text transformer model is employed to generate text transcripts. For evaluation, the study uses ten-word categories from version 2 of the Google Speech Commands dataset. To reduce noise in the audio samples, a Kalman filter is applied, ensuring cleaner signal processing. The resulting Mel spectrograms are resized to 256×256 pixels to produce two-dimensional visual representations of the audio data. These images are then classified using EfficientNetV2, pre-trained on the ImageNet dataset. In parallel, a grapheme-to-phoneme (G2P) model is used to convert Speech2Text outputs into phonemes. These are further processed through a technique called phoneme slicing, which extracts core phonological units—such as fricatives, nasals, liquids, glides, plosives, approximants, taps/flaps, trills, and vowels—based on articulatory features like manner and place of articulation. The proposed system employs a late fusion strategy that combines phoneme embeddings with image-based embeddings to achieve high classification accuracy. This fusion not only boosts ASR performance but also underscores the value of incorporating linguistic and phonological knowledge into spoken language understanding. Through comprehensive ablation analysis, the study demonstrates that the integration of spectrograms and phonological analysis sets a new benchmark, outperforming existing models in terms of accuracy and interpretability.