Article:Likelihood Ratio Based Score Fusion for Audio-Visual Speaker Identification in Challenging Environment

Md. Rabiul Islam; Md. Fayzur Rahman

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

Evaluating Text-to-Text Generation from LLMs: A Case Study and Scalable Framework

Ziqiao Ao Juhi Singh Sebastian Antinome

Random Articles

Reseach Article

Article:Likelihood Ratio Based Score Fusion for Audio-Visual Speaker Identification in Challenging Environment

by Md. Rabiul Islam, Md. Fayzur Rahman

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 6 - Number 7

Year of Publication: 2010

Authors: Md. Rabiul Islam, Md. Fayzur Rahman

10.5120/1091-1425

Md. Rabiul Islam, Md. Fayzur Rahman . Article:Likelihood Ratio Based Score Fusion for Audio-Visual Speaker Identification in Challenging Environment. International Journal of Computer Applications. 6, 7 ( September 2010), 6-11. DOI=10.5120/1091-1425

@article{ 10.5120/1091-1425,

author = { Md. Rabiul Islam, Md. Fayzur Rahman },

title = { Article:Likelihood Ratio Based Score Fusion for Audio-Visual Speaker Identification in Challenging Environment },

journal = { International Journal of Computer Applications },

issue_date = { September 2010 },

volume = { 6 },

number = { 7 },

month = { September },

year = { 2010 },

issn = { 0975-8887 },

pages = { 6-11 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume6/number7/1091-1425/ },

doi = { 10.5120/1091-1425 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T19:54:46.544293+05:30

%A Md. Rabiul Islam

%A Md. Fayzur Rahman

%T Article:Likelihood Ratio Based Score Fusion for Audio-Visual Speaker Identification in Challenging Environment

%J International Journal of Computer Applications

%@ 0975-8887

%V 6

%N 7

%P 6-11

%D 2010

%I Foundation of Computer Science (FCS), NY, USA

Abstract

It is well known to enhance the performance of noise robust speaker identification using visual speech information with audio utterances. This paper presents an approach to evaluate the performance of a noise robust audio-visual speaker identification system using likelihood ratio based score fusion in challenging environment. Though the traditional HMM based audio-visual speaker identification system is very sensitive to the speech parameter variation, the proposed likelihood ratio based score fusion method is found to be stance and performs well for improving the robustness and naturalness of human-computer-interaction. In this paper, we investigate the proposed audio-visual speaker identification system in typical office environments conditions. To do this, we investigated two approaches that utilize speech utterance with visual features to improve speaker identification performance in acoustically and visually challenging environment: one seeks to eliminate the noise from the acoustic and visual features by using speech and facial image pre-processing techniques. The other task combines speech and facial features that have been used by the multiple Discrete Hidden Markov Model classifiers with likelihood ratio based score fusion. It is shown that the proposed system can improve a significant amount of performance for audio-visual speaker identification in challenging official environment conditions.

References

D. G. Stork andM. E. Hennecke, Eds., Speechreading by Humans and Machines. Berlin, Germany: Springer, 1996.
R. Campbell, B. Dodd, and D. Burnham, Eds., Hearing by Eye II. Hove, United Kingdom: Psychology Press Ltd. Publishers, 1998.
S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, 2(3):141–151, 2000.
G. Potamianos, J. Luettin, and C. Neti, “Hierarchical discriminant features for audio-visual LVCSR,” Proc. Int. Conf. Acoust., Speech, Signal Process., pp. 165–168, 2001.
G. Potamianos and C. Neti, “Automatic speechreading of impaired speech,” Proc. Conf. Audio-Visual Speech Process., pp. 177–182, 2001.
F.J. Huang and T. Chen, “Consideration of Lombard effect for speechreading,” Proc. Works. Multimedia Signal Process., pp. 613–618, 2001.
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior, “Recent advances in the automatic recognition of audio-visual speech,” To Appear: Proc. IEEE, 2003.
Reynolds, D.A., “Experimental evaluation of features for robust speaker identification,” IEEE Transactions on SAP, Vol. 2, 1994, 639-643.
Sharma, S., Ellis, D., Kajarekar, S., Jain, P. & Hermansky, H., “Feature extraction using non-linear transformation for robust speech recognition on the Aurora database,” Proc. ICASSP2000, 2000.
Wu, D., Morris, A.C. & Koreman, J., “MLP Internal Representation as Disciminant Features for Improved Speaker Recognition,” Proc. NOLISP2005, Barcelona, Spain, 2005, 25-33.
Konig, Y., Heck, L., Weintraub, M. & Sonmez, K., “Nonlinear discriminant feature extraction for robust text-independent speaker recognition,” Proc. RLA2C, ESCA workshop on Speaker Recognition and its Commercial and Forensic Applications, 1998, 72-75.
C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “A review of speech-based bimodal recognition,” IEEE Trans. Multimedia, vol. 4, pp. 23–37, Mar. 2002.
X. Zhang, C. C. Broun, R. M. Mersereau, and M. Clements, “Automatic speechreading with applications to human-computer interfaces,” EURASIP J. Appl. Signal Processing, vol. 2002, pp. 1228–1247, Nov. 2002.
D. N. Zotkin, R. Duraiswami, and L. S. Davis, “Joint audio-visual tracking using particle filters,” EURASIP J. Appl. Signal Processing, vol. 2002, pp. 1154–1164, Nov. 2002.
P. De Cuetos, C. Neti, and A. Senior, “Audio-visual intent to speak detection for human computer interaction,” in Proc. Int. Conf. Acoust., Speech, Signal Processing, Istanbul, Turkey, June 5–9, 2000, pp. 1325–1328.
D. Sodoyer, J.-L. Schwartz, L. Girin, J. Klinkisch, and C. Jutten, “Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli,” EURASIP J. Appl. Signal Processing, vol. 2002, pp. 1165–1173, Nov. 2002.
E. Foucher, L. Girin, and G. Feng, “Audiovisual speech coder: Using vector quantization to exploit the audio/video correlation,” in Proc. Conf. Audio-Visual Speech Processing, Terrigal, Australia, Dec. 4–6, 1998, pp. 67–71.
J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. Wong, “Integration of multimodal features for video scene classification based on HMM,” in Proc. Works. Multimedia Signal Processing, Copenhagen, Denmark, Sept. 13–15, 1999, pp. 53–58.
M. M. Cohen and D. W. Massaro, “What can visual speech synthesis tell visual speech recognition?,” in Proc. Asilomar Conf. Signals, Systems, Computers, Pacific Grove, CA, 1994.
E. Cosatto and H. P. Graf, “Photo-realistic talking-heads from image samples,” IEEE Trans. Multimedia, vol. 2, pp. 152–163, Sept. 2000.
Gerasimos Potamianos, Chalapathy Neti, and Sabine Deligne, “Joint Audio-Visual Speech Processing for Recognition and Enhancement,” Auditory-Visual Speech Processing Tutorial and Research Workshop (AVSP), pp. 95-104, St. Jorioz, France, September 2003.
Simon Doclo and Marc Moonen, “On the Output SNR of the Speech-Distortion Weighted Multichannel Wiener Filter”, IEEE SIGNAL PROCESSING LETTERS, VOL. 12, NO. 12, DECEMBER 2005.
Wiener, N., Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications. Wiely, Newyork, 1949.
Wiener, N., Paley, R. E. A. C., “Fourier Transforms in the Complex Domains,” American Mathematical Society, Providence, RI, 1934.
Koji Kitayama, Masataka Goto, Katunobu Itou and Tetsunori Kobayashi, “Speech Starter: Noise-Robust Endpoint Detection by Using Filled Pauses,” Eurospeech 2003, Geneva, pp. 1237-1240.
S. E. Bou-Ghazale and K. Assaleh, “A robust endpoint detection of speech for noisy environments with application to automatic speech recognition,” in Proc. ICASSP2002, vol. 4, 2002, pp. 3808–3811.
A. Martin, D. Charlet, and L. Mauuary, “Robust speech / non-speech detection using LDA applied to MFCC,” in Proc. ICASSP2001, vol. 1, 2001, pp. 237–240.
Richard. O. Duda, Peter E. Hart, David G. Strok, Pattern Classification, A Wiley-interscience publication, John Wiley & Sons, Inc, Second Edition, 2001.
Sarma, V., Venugopal, D., “Studies on pattern recognition approach to voiced-unvoiced-silence classification,” Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP '78. , Volume: 3, Apr 1978, Pages: 1-4.
Qi Li. Jinsong Zheng, Augustine Tsai, Qiru Zhou, “Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition,” IEEE Transaction on speech and Audion Processing, Vol.10, No.3, March, 2002.
Harrington, J., and Cassidy, S., Techniques in Speech Acoustics. Kluwer Academic Publishers, Dordrecht, 1999.
Makhoul, J., “Linear prediction: a tutorial review,” Proceedings of the IEEE 64, 4 (1975), 561–580.
Picone, J., “Signal modeling techniques in speech recognition,” Proceedings of the IEEE 81, 9 (1993), 1215–1247.
Clsudio Beccchetti and Lucio Prina Ricotti, Speech Recognition Theory and C++ Implementation, John Wiley & Sons. Ltd., 1999, pp.124-136.
L.P. Cordella, P. Foggia, C. Sansone, M. Vento., "A Real-Time Text-Independent Speaker Identification System", Proceedings of 12th International Conference on Image Analysis and Processing, IEEE Computer Society Press, Mantova, Italy, pp. 632 - 637 , September , 2003.
J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals. Macmillan, 1993.
F. Owens., Signal Processing Of Speech. Macmillan New electronics. Macmillan, 1993.
F. Harris, “On the use of windows for harmonic analysis with the discrete fourier transform,” Proceedings of the IEEE 66, vol.1 (1978), pp.51-84.
J. Proakis and D. Manolakis, Digital Signal Processing, Principles, Algorithms and Aplications. Second edition, Macmillan Publishing Company, New York, 1992.
D. Kewley-Port and Y. Zheng, “Auditory models of formant frequency discrimination for isolated vowels”, Journal of the Acostical Society of America, 103(3):1654–1666, 1998.
D. O’Shaughnessy, Speech Communication - Human and Machine, Addison Wesley, 1987.
E. Zwicker., “Subdivision of the audible frequency band into critical bands (frequenzgruppen)”, Journal of the Acoustical Society of America, 33:248–260, 1961.
S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences”, IEEE Transactions on Acoustics Speech and Signal Processing, 28:357–366, Aug 1980.
M. Hwang, X. Huang, "Shared-Distribution Hidden. Markov Models for Speech Recognition", IEEE. Trans. on. Speech and Audio Processing, vol. 1, No. 4, pp. 414-420, April 1993.
L.E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains", The Annals of Mathematical Statistics, 41, 1970, pp. 164-171.
R.J.Elliott, L. Aggoun, and J.B. Moore, “Hidden Markov Models: Estimation and Control”, Applications of Mathematics: Stochastic Modeling and Applied Probability, Vol. 29, Springer, Berlin, 1997.
Stephen Milborrow and Fred Nicolls, “Locating Facial Features with an Extended Active Shape Model,” available at http://www.milbo.org/stasm-files/locating-facial-features-with-an-extended-asm.pdf.
R. Herpers, G. Verghese, K. Derpains and R. McCready, “Detection and tracking of face in real environments,” IEEE Int. Workshop on Recognition, Analysis and Tracking of Face and Gesture in Real- Time Systems, Corfu, Greece, pp. 96-104, 1999.
J. Daugman, “Face detection: a survey,” Comput. Vis. Imag. Underst, 83, 3, pp. 236- 274, 2001.
Rafael C. Gonzalez and Richard E. Woods, Digital Image Processing. Addison-Wesley, 2002.
A. Rogozan, P.S. Sathidevi, “Static and dynamic features for improved HMM based visual speech recognition,” 1st International Conference on Intelligent Human Computer Interaction, 9Allahabad, India, 20090, pp. 184-194.
J. S. Lee, C. H. Park, “Adaptive Decision Fusion for Audio-visual speech Recognition”, Speech Recognition, Technologies and Applications, ed. F. Mihelic, J. Zibert, (Vienna, Australia, 2008), pp. 550, 2008.
A. Adjoudant, C. Benoit, “On the integration of auditory and visual parameters in an HMM-based ASR,” Speechreading by Humans and Machines: Models, Systems, and Speech Recognition, Technologies and Applications, ed. D.G. Strok and M. E. Hennecke, (Springer, Berlin, Germany, 1996), pp. 461-472.
N. A. Fox, B. A. O'Mullane and R. B. Reilly, “The Realistic Multi-modal VALID database and Visual Speaker Identification Comparison Experiments,” Proc. of the 5th International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA-2005), New York, 2005.

Index Terms

Computer Science

Information Sciences

Keywords

Audio-Visual Speaker Identification Cepstral Base Features Feature Fusion Decision Fusion Likelihood Ratio Based Score Fusion Discrete Hidden Markov Model