CFP last date
20 January 2025
Reseach Article

Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition

by Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 39 - Number 6
Year of Publication: 2012
Authors: Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh
10.5120/4824-7076

Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh . Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition. International Journal of Computer Applications. 39, 6 ( February 2012), 19-23. DOI=10.5120/4824-7076

@article{ 10.5120/4824-7076,
author = { Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh },
title = { Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition },
journal = { International Journal of Computer Applications },
issue_date = { February 2012 },
volume = { 39 },
number = { 6 },
month = { February },
year = { 2012 },
issn = { 0975-8887 },
pages = { 19-23 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume39/number6/4824-7076/ },
doi = { 10.5120/4824-7076 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:25:44.483412+05:30
%A Nitin Mishra
%A C. Patvardhan
%A C. Vasantha Lakshmi
%A Sarika Singh
%T Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition
%J International Journal of Computer Applications
%@ 0975-8887
%V 39
%N 6
%P 19-23
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Tesseract OCR Engine is one of the most efficient open source OCR engines currently available. Recently, Tesseract OCR 3.01 is capable of recognizing Hindi language but still it needs some enhancement to improve the performance. The Hindi language recognition accuracy is quite low even for the printed text, as the conjunct character combinations of Hindi Language are not easily separable due to partial overlapping. The proposed approach solves this problem, so that Devanagari conjunct characters can easily be segmented and recognized using Tesseract OCR Engine. This paper presents a complete methodology to improve The Hindi Language Recognition accuracy. This paper also presents comparison with other Devanagari OCR engines available on the basis of recognition accuracy, processing time, font variations and database size.

References
  1. Google code : http://googlecode.blogspot.com/2006/08/announcing-tesseract-ocr.html (last accessed 8 January, 2012)
  2. http://code.google.com/p/tesseract-ocr/ (last accessed 8 January, 2012)
  3. Smith, R. “An Overview of the Tesseract OCR” in proc. ICDAR 2007, Curitiba, Paraná, Brazil.
  4. Bansal, V. and Sinha, R.M.K. “A Complete OCR for Printed Hindi Text in Devnagari Script”, Sixth International Conference on Document Analysis and Recognition, IEEE Publication, Seatle USA, 2001, Page(s):800-804.
  5. Jindal, M.K., Sharma, R.K., lehal, G.S. “A Study of Different Kinds of Degradation in Printed Gurmukhi Script”, Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07),2007.
  6. Yadav, D., Sharma, A.K. and Gupta, J.P. Optical character recognition for printed Hindi text in Devanagari using soft-computing technique, IASTED International Multi-Conference: Artificial Intelligence and Applications, Innsbruck, Austria, 2007, pp. 102-107
  7. Chaudhuri, B. B. and Pal, U. "An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)", Proc. of 4th ICDAR vol.2, Ulm, Germany, 1997, Page(s): 1011 -1015
  8. Hasnat, A., Chowdhury, M. and Khan, M. "Integrating Bangla script recognition support in Tesseract OCR", Proc. of the Conference on Language and Technology 2009 (CLT09), Lahore, Pakistan, 2009.
  9. Pal, U., Chaudhuri, B. B. ''Indian Script Character recognition: A survey'', Pattern Recognition, vol. 37, pp. 1887-1899, 2004..
  10. tesseract-ocr An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. Available at: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3/ (last accessed 8 January, 2012)
  11. Agrawal, P., Hanmandlu, M. and Lall, B., “Coarse Classification of Handwritten Hindi Characters”, International Journal of Advanced Science and Technology,Vol. 10, September, 2009.
  12. Saba, T., Sulong, G. and Rehman, A. “A Survey on Methods and Strategies on Touched Characters Segmentation”, International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 1, No. 2, June 2010.
  13. tesseract-ocr, available at: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.01.hin.tar.gz&can=2&q= (last accessed 8 January, 2012)
  14. parichit The best open source OCR for Indian Languages...yet, available at: http://code.google.com/p/parichit/downloads/detail?name=hin.traineddata (last accessed 8 January, 2012)
Index Terms

Computer Science
Information Sciences

Keywords

Tesseract Hindi OCR Shirorekha Chopping Character Segmentation