CFP last date
20 January 2025
Reseach Article

A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique

by Gend Lal Prajapati, Rekha Saha
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 123 - Number 16
Year of Publication: 2015
Authors: Gend Lal Prajapati, Rekha Saha
10.5120/ijca2015905763

Gend Lal Prajapati, Rekha Saha . A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique. International Journal of Computer Applications. 123, 16 ( August 2015), 31-35. DOI=10.5120/ijca2015905763

@article{ 10.5120/ijca2015905763,
author = { Gend Lal Prajapati, Rekha Saha },
title = { A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique },
journal = { International Journal of Computer Applications },
issue_date = { August 2015 },
volume = { 123 },
number = { 16 },
month = { August },
year = { 2015 },
issn = { 0975-8887 },
pages = { 31-35 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume123/number16/22046-2015905763/ },
doi = { 10.5120/ijca2015905763 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:12:54.494606+05:30
%A Gend Lal Prajapati
%A Rekha Saha
%T A Statistical Approach for Estimating Language Model Reliability with Effective Smoothing Technique
%J International Journal of Computer Applications
%@ 0975-8887
%V 123
%N 16
%P 31-35
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Language Model smoothing is an imperative technology which deals with unseen test data by re-evaluating some zero-probability n-grams and assign them bare minimum non-zero values. There is an assortment of smoothing techniques employed to trim down tiny amount of probability from the probable grams and share out to zero probable grams within a Language Model. Kneser Ney and Latent Dirichlet Allocation algorithm are two probable techniques used for proficient smoothing. In this paper, a scheme is proposed for effective smoothing by combining Kneser Ney and Latent Dirichlet Allocation approaches. Moreover, another scheme is proposed to measure the reliability of a Language Model and determine the association between entropy and perplexity. These schemes are demonstrated by appropriate examples.

References
  1. Teemu, V.H. and Virpoija, S. 2003. On growing and pruning Kneser Ney Smoothed N-gram Models. IEEE Transaction in Audio, Speech, and Language Processing. 1617-1624.
  2. Sethy, A., Georgiou, P., Ramabhandran, B. and Narayan, S. 2007. An Iterative Relative minimization Based Data Selection Approach for N-gram Model Adaptation. IEEE Transaction on Audio, Speech, and Language Processing. 13-23.
  3. Blei, D. M., Andrew, Y. and Micheal, I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. 993–1022.
  4. Witten, I.H. and Bell, T.C. 1991. The Zero-Frequency Problem: Estimating the probabilities of Novel Events in Adaptive Text Compression. IEEE Transaction on Information Theory. 1085 – 1094.
  5. Gao, J. and Lee, K.F. 2000. Distribution-based pruning of backoff language models. Association for Computational Linguistics. 579-588.
  6. Yuret, D. 2008. Smoothing a tera-word language model. Association for Computational Linguistics. 141-144.
  7. Chen, S.F. and Goodman, J.T. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language. 359–394.
  8. Hazem, A. and Morin E. 2013. A Comparison of Smoothing Techniques for Bilingual Lexicon Extraction from Comparable Corpora. Association for Computational Linguistics. 24-33.
  9. Shen, Z.Y., Sun, J. and Shen, Y.D. 2008. Collective Latent Dirichlet Allocation. Data Mining ICDM. 1019-1024.
  10. Chen, S., Beeferman, D. and Rosenfeld, R. 2002. Evaluation Metrics for Language Models. Association for Computational Linguistics. 176-182.
  11. Kim, W., Khudanpur, S. and Wu, J. 2001. Smoothing Issues in the Structured Language Model. EuroSpeech. 717-720.
  12. Gao, J. and Zhang, M. 2002. Improving language model size reduction sing better pruning criteria. Association for Computational Linguistics. 176-182.
  13. Taraba, B. 2007. Kneser–Ney Smoothing With a Correcting Transformation for Small Data Sets. IEEE Transaction on Audio, Speech, and Language Processing. 1912-1921.
  14. Zhai, C. and Lafferty, J. 2001. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval.  SIGIR conference on Research and development in information retrieval. 334-342.
  15. Wei, X., Crof and W. B. 2006. LDA-Based Document Models for Ad-hoc Retrieval. SIGIR conference on Research and development in information retrieval. 178-185.
  16. Chung, Y.M. and Lee, J.E. 2001. A Corpus-Based Approach to Comparative Evaluation of Statistical Term Association Measures. Journal Of The American Society For Information Science And Technology. 283–296.
  17. Huang, F.L., Yu, M.S. and Hwang, C.Y. 2013. An Empirical Study of Good-Turing Smoothing for Language Models on Different Size Corpora of Chinese. Journal of Computer and Communications. 14-19.
  18. Ding, G. and Wang B. 2005. GJM-2: A Special Case of General Jelinek-Mercer Smoothing Method. G.G. Lee et al. (Eds.): AIRS, Vol. 3689. Springer-Verlag Berlin Heidelberg. 491 – 496
  19. Sundermeyer, M., Schl¨uter, R. and Ney, H. 2011. On the Estimation of Discount Parameters for Language Model Smoothing. Interspeech Florence, Italy. 1433-1436.
Index Terms

Computer Science
Information Sciences

Keywords

Smoothing Pruning Entropy Perplexity Data Sparsity Statistical Control Information Retrieval.