International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 177 - Number 10 |
Year of Publication: 2019 |
Authors: Aamer Zahoor, Nasir Ahmad |
10.5120/ijca2019919522 |
Aamer Zahoor, Nasir Ahmad . A Large Multilingual Corpus of Pashto, Urdu, English for Automatic Spoken Language Identification. International Journal of Computer Applications. 177, 10 ( Oct 2019), 42-45. DOI=10.5120/ijca2019919522
The availability of a standard and phonetically rich speech corpus provides a common platform for comparing the performance of different speech recognition approaches and therefore is the first step for the research in a language. This work presents the development of a large multilingual speech corpus of Pashto, Urdu and English. Recordings have been made from a total of 194 speakers in the three languages, covering diverse dialects, age groups, genders and professions. Pashto and Urdu both native and non-native speakers have been considered while for English, all the speakers were non-native. The corpus comprises of three categories of phonetically rich spoken data in each language, that is, short questions regarding speaker’s personal information, read speech and spontaneous speech from the domain of tourism. Although the corpus is developed primarily for research on Automatic Spoken Language Identification purpose, nevertheless, it can also be used for research on other topics such as Automatic Speech Recognition, Accent Recognition, Automatic Speaker Identification and the study of effects of non-nativeness on Language and Speaker Identification.