‘vVISWa’ – A Multilingual Multi-Pose Audio Visual Database for Robust Human Computer Interaction

Prashant Borde; Ramesh Manza; Bharti Gawali; Pravin Yannawar

Call for Paper

April Edition

IJCA solicits high quality original research papers for the upcoming April edition of the journal. The last date of research paper submission is 20 March 2026

Submit your paper

Know more

The week's pick

Explainable Hybrid Deep Learning for Automated Diagnosis of Canine Mammary Tumors

Elham Shawky Salama Heba Askr Ashraf Darwish Aboul Ella Hassanien

Random Articles

Reseach Article

‘vVISWa’ – A Multilingual Multi-Pose Audio Visual Database for Robust Human Computer Interaction

by Prashant Borde, Ramesh Manza, Bharti Gawali, Pravin Yannawar

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 137 - Number 4

Year of Publication: 2016

Authors: Prashant Borde, Ramesh Manza, Bharti Gawali, Pravin Yannawar

10.5120/ijca2016908696

Prashant Borde, Ramesh Manza, Bharti Gawali, Pravin Yannawar . ‘vVISWa’ – A Multilingual Multi-Pose Audio Visual Database for Robust Human Computer Interaction. International Journal of Computer Applications. 137, 4 ( March 2016), 25-31. DOI=10.5120/ijca2016908696

@article{ 10.5120/ijca2016908696,

author = { Prashant Borde, Ramesh Manza, Bharti Gawali, Pravin Yannawar },

title = { ‘vVISWa’ – A Multilingual Multi-Pose Audio Visual Database for Robust Human Computer Interaction },

journal = { International Journal of Computer Applications },

issue_date = { March 2016 },

volume = { 137 },

number = { 4 },

month = { March },

year = { 2016 },

issn = { 0975-8887 },

pages = { 25-31 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume137/number4/24265-2016908696/ },

doi = { 10.5120/ijca2016908696 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:37:28.734454+05:30

%A Prashant Borde

%A Ramesh Manza

%A Bharti Gawali

%A Pravin Yannawar

%T ‘vVISWa’ – A Multilingual Multi-Pose Audio Visual Database for Robust Human Computer Interaction

%J International Journal of Computer Applications

%@ 0975-8887

%V 137

%N 4

%P 25-31

%D 2016

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Automatic Speech Recognition (ASR) by machine is an attractive research topic in signal processing domain and has attracted many researchers to contribute in this area of signal processing and pattern recognition. In recent year, there have been many advances in automatic speech reading system with the inclusion of audio and visual speech features to recognize words under noisy conditions. The objective of audio-visual speech recognition system is to improve recognition accuracy. In order to develop robust AVSR systems under Human Computer Interaction an appropriate simultaneously recorded speech and video data are needed. This paper describes a ‘vVISWa’ (Visual Vocabulary of Independent Standard Words) database consists of audio visual data of 48 native speakers and 10 nonnative speakers. These speakers have contributed towards development of corpus in three profiles that is full frontal, 450 profile and side pose. This database was primarily designed to deal with Multi-pose Audio Visual Speech Recognition system for three languages that is, ‘Marathi’ (The Native language of Maharashtra), ‘Hindi’ (National Language of India) and ‘English’ (Universal language). This database is multi-pose, multi-lingual database formed in Indian context. This database available by request from http://visbamu.in/viswaDataset.html.

References

H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, pages 746–748, September 1976.
G. Potamianos, C. Neti, and G. Gravier. Recent advances in the automatic recognition of audio-visual speech. Proceedings of the IEEE, 91(9):1306–1326, 2003.
B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys,M. Liu, T. Huang, AVICAR: audio-visual speech corpus in a car environment, Proc. Annu. Conf. Int. Speech Commun. Assoc. (INTERSPEECH), 2004, pp. 380–383.
V. Zue, S. Sene, J. Glass, Speech database development: TIMIT and beyond, Speech Commun. 9 (4) (1990) 351–356.
I. Matthews, T. Cootes, J. Bangham, S. Cox, R. Harvey, Extraction of visual features for lip reading, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2) (2002) 198–213.
S. Cox, R. Harvey, Y. Lan, J. Newman, B. Theobald, The challenge of multi speaker lip-reading, Proc. Int. Conf. Auditory-Visual Speech Process. (AVSP), 2008, pp. 179–184.
T. Hazen, K. Saenko, C. La, J. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, Proc. Int. Conf. Multimodal, Interfaces, 2004, pp. 235–242.
E. Patterson, S. Gurbuz, Z. Tufekci, J. Gowdy, CUAVE: a new audio-visual database for multimodal human-computer interface research, Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol. 2, 2002, pp. 2017–2020.
P. Lucey, G. Potaminanos, S. Sridharan, Patch-based analysis of visual speech from multiple views, Proc. Int. Conf. Auditory–Visual Speech Process. (AVSP), 2008, pp. 69–74.
http://www.ee.surrey.ac.uk/Projects/LILiR/index.html.
P. Price, W. Fisher, J. Bernstein, D. Pallett, Resource Management RM2 2.0, Linguistic Data Consortium, Philadelphia, 1993.
McCool, Chris, Sebastien Marcel, Abdenour Hadid, Matti Pietikainen, Pavel Matejka, Jan Cernocky, Norman Poh et al. "Bi-modal person recognition on a mobile phone: using mobile phone data." In Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on, pp. 635-640. IEEE, 2012.
G. Zhao, M. Barnard, M. Pietikäinen, Lipreading with local spatiotemporal descriptors, IEEE Trans. Multimedia 11 (7) (2009) 1254–1265.
M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Am. 120 (5) (2008) 2421–2424.
K. Messer, J.Matas, J. Kittler, J. Luettin, G. Maitre, XM2VTSDB: the extended M2VTS database, Proc. Int. Conf. Audio, Video-Based Biometrics Person Authentication (AVBPA), 1999.
Resource Centre for Indian Language Technology Solutions (CFILT), IIT Bombay, http://www.cfilt.iitb.ac.in/
P.A.M.J. Viola. "Rapid Object Detection Using a Boosted Cascade of Simple Features," in Proc. IEEE Conf. Computer vision and Pattern Recognition.
Amarsinh Varpe, Prashant Borde, Pallavi Pardeshi, Sadhana Sukale, Pravin Yannawar, “Analysis of Induced Color for Automatic Detection of ROI Multipose AVSR System”, Springer International conference on Information System Design and Intelligent Application, 10.1007/978-81-322-2247-7_54, pp 525-538, 2015.
Borde Prashant, Amarsinh Varpe, Ramesh Manza, and Pravin Yannawar. "Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition." International Journal of Speech Technology (2014): 1-9.

Index Terms

Computer Science

Information Sciences

Keywords

Automatic Speech Recognition (ASR) Visual Speech Reading (VSR) Multi-pose Audio Visual Speech Recognition (AVSR) and ‘vVISWa’.