CFP last date
20 February 2025
Reseach Article

A Comparative Analysis of Feature Extraction Techniques and Classifiers Inaccuracies for Bilingual Printed Documents (Gujarati-English)

Published on July 2016 by Shailesh A. Chaudhari, Ravi M. Gulati
International Conference on Communication Computing and Virtualization
Foundation of Computer Science USA
ICCCV2016 - Number 1
July 2016
Authors: Shailesh A. Chaudhari, Ravi M. Gulati

Shailesh A. Chaudhari, Ravi M. Gulati . A Comparative Analysis of Feature Extraction Techniques and Classifiers Inaccuracies for Bilingual Printed Documents (Gujarati-English). International Conference on Communication Computing and Virtualization. ICCCV2016, 1 (July 2016), 16-20.

@article{
author = { Shailesh A. Chaudhari, Ravi M. Gulati },
title = { A Comparative Analysis of Feature Extraction Techniques and Classifiers Inaccuracies for Bilingual Printed Documents (Gujarati-English) },
journal = { International Conference on Communication Computing and Virtualization },
issue_date = { July 2016 },
volume = { ICCCV2016 },
number = { 1 },
month = { July },
year = { 2016 },
issn = 0975-8887,
pages = { 16-20 },
numpages = 5,
url = { /proceedings/icccv2016/number1/914-1654/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 International Conference on Communication Computing and Virtualization
%A Shailesh A. Chaudhari
%A Ravi M. Gulati
%T A Comparative Analysis of Feature Extraction Techniques and Classifiers Inaccuracies for Bilingual Printed Documents (Gujarati-English)
%J International Conference on Communication Computing and Virtualization
%@ 0975-8887
%V ICCCV2016
%N 1
%P 16-20
%D 2016
%I International Journal of Computer Applications
Abstract

In a bilingual or multi-lingual optical character recognition system script identification is a challenging task. A remarkable research work on script identification have been noted in Indian or non-Indian context. As many commercial and official regional documents of different states of India are in bilingual containing one regional language of respective state and the other international intersperse language English. Therefore script identification is one of the primary tasks in multi-script document recognition. English words are mostly interspersed in regional documents of different states of India. In this paper script identification of Gujarati and English at word level is presented. For feature extraction two approach are used. In the first approach statistical features and in second approach the Gabor features of a word using Gabor filters with suitable frequencies and orientations are extracted. The proposed system uses two classifiers k-NN and SVM with different kernel functions used to classify the extracted features in one of the script. From the experiment it has been perceived that SVM outperform then k-NN.

References
  1. . Ghosh D. , Dube T. , Shivaprasad A. P. , Script Recognition A Review. IEEE, Transactions on Patter Analysis and Machine Intelligence 2010. vol. 32, no. 12, pp. 2142-2161.
  2. . Chaudhari S. , Gulati R. , A Survey on Script Identification in Multi-script Indian Documents. VNSGU journal of Science and Technology 2012. Vol 3, Issue 2, pp. 138-152.
  3. . Chaudhuri. B. B, Pal. U, An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In Proc. 4th ICDAR, Uhn. 1997.
  4. . Pal U. , Chaudhuri B. B. , Script Line Separation from Indian Multi-Script Documents. Proc. Int'l Conf. Document Analysis and Recognition. 1999. pp. 406-409.
  5. . Pal U. , Chaudhuri. B. B, Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line. Proc. 6th Intl. Conf: Document Analysis and Recognition (ICDAR'OI). 2001. pages 790-794.
  6. . Padma M. C. , Vijaya P. A. Global Approach for Script Identification using Wavelet Packet Based Features. International Journal of Signal Processing, Image Processing and Pattern Recognition. 2010. Vol. 3, No. 3.
  7. . Patil B. , Subbareddy N. V. Neural network based system for script identification in Indian documents. Sadhana 2002. Vol. 27, part-i1, pp 83-97.
  8. . Dhandra B. V. , Nagabhushan P. , Hangarge M. , Hegadi R. , Malemath V. S. , Script Identification Based on Morphological Reconstruction in Document Images. Proc. IEEE Int'l Conf. Pattern Recognition. 2006. vol. 2, pp. 950-953.
  9. . Vikram T. N. , Guru D. S. Appearance based models in document script identification. ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - 2007. Volume 02.
  10. . Dhanya. D, Ramakrishnan. A. G, Peeta B. P. Script Identification In Printed Bilingual Documents. Sadhana, 2002. Vol. 27, Part-1, Pp. 73-82.
  11. . Sukalpa C. , Pal S. , Katrin F. , Pal U. Two-stage Approach for Word-wise Script Identification. 10th International Conference on Document Analysis and Recognition. 2009.
  12. . Pal U. , Sinha S. , Chaudhuri B. B. Multi-Script Line Identification from Indian Documents. Proc. Int'l Conf. Document Analysis and Recognition. 2003. pp. 880-884.
  13. . Kunte R. S. , Sudhaker S. A Bilingual Machine-Interface OCR for Printed Kannada and English Text Employing Wavelet Features. 10th International Conference on Information Technology. 2007.
  14. . Aparna KG, Dhanya D. , Ramakrishnan AG, Bilingual (Tamil – Roman) Text Recognition on Windows, Tamil Internet. California, USA 2002.
  15. . Dhandra BV, Mallikarjun H. , Hegadi R. , Malemath VS Word–wise Script Identification based on Morphological Reconstruction in Printed Bilingual Documents. In the proc. of IET International Conference on Vision Information Engineering VIE, Bangalore 2006. pp. 389-393.
  16. . Dhandra BV, Mallikarjun H. On Separation of English Numerals from Multilingual Document Images, In the journal of multimedia 2007. Vol 2, No 6, pp. 26-33.
  17. . Cortes C, Vapnik VSupport vector network. Machine Learning. , 1995. 20:273–297.
Index Terms

Computer Science
Information Sciences

Keywords

Gabor Filter Support Vector Machine Feature Extraction.