Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition

Nitin Mishra; C. Patvardhan; C. Vasantha Lakshmi; Sarika Singh

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition

by Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 39 - Number 6

Year of Publication: 2012

Authors: Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh

10.5120/4824-7076

Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh . Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition. International Journal of Computer Applications. 39, 6 ( February 2012), 19-23. DOI=10.5120/4824-7076

@article{ 10.5120/4824-7076,

author = { Nitin Mishra, C. Patvardhan, C. Vasantha Lakshmi, Sarika Singh },

title = { Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition },

journal = { International Journal of Computer Applications },

issue_date = { February 2012 },

volume = { 39 },

number = { 6 },

month = { February },

year = { 2012 },

issn = { 0975-8887 },

pages = { 19-23 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume39/number6/4824-7076/ },

doi = { 10.5120/4824-7076 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:25:44.483412+05:30

%A Nitin Mishra

%A C. Patvardhan

%A C. Vasantha Lakshmi

%A Sarika Singh

%T Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition

%J International Journal of Computer Applications

%@ 0975-8887

%V 39

%N 6

%P 19-23

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Tesseract OCR Engine is one of the most efficient open source OCR engines currently available. Recently, Tesseract OCR 3.01 is capable of recognizing Hindi language but still it needs some enhancement to improve the performance. The Hindi language recognition accuracy is quite low even for the printed text, as the conjunct character combinations of Hindi Language are not easily separable due to partial overlapping. The proposed approach solves this problem, so that Devanagari conjunct characters can easily be segmented and recognized using Tesseract OCR Engine. This paper presents a complete methodology to improve The Hindi Language Recognition accuracy. This paper also presents comparison with other Devanagari OCR engines available on the basis of recognition accuracy, processing time, font variations and database size.

References

Google code : http://googlecode.blogspot.com/2006/08/announcing-tesseract-ocr.html (last accessed 8 January, 2012)
http://code.google.com/p/tesseract-ocr/ (last accessed 8 January, 2012)
Smith, R. “An Overview of the Tesseract OCR” in proc. ICDAR 2007, Curitiba, Paraná, Brazil.
Bansal, V. and Sinha, R.M.K. “A Complete OCR for Printed Hindi Text in Devnagari Script”, Sixth International Conference on Document Analysis and Recognition, IEEE Publication, Seatle USA, 2001, Page(s):800-804.
Jindal, M.K., Sharma, R.K., lehal, G.S. “A Study of Different Kinds of Degradation in Printed Gurmukhi Script”, Proceedings of the International Conference on Computing: Theory and Applications (ICCTA'07),2007.
Yadav, D., Sharma, A.K. and Gupta, J.P. Optical character recognition for printed Hindi text in Devanagari using soft-computing technique, IASTED International Multi-Conference: Artificial Intelligence and Applications, Innsbruck, Austria, 2007, pp. 102-107
Chaudhuri, B. B. and Pal, U. "An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)", Proc. of 4th ICDAR vol.2, Ulm, Germany, 1997, Page(s): 1011 -1015
Hasnat, A., Chowdhury, M. and Khan, M. "Integrating Bangla script recognition support in Tesseract OCR", Proc. of the Conference on Language and Technology 2009 (CLT09), Lahore, Pakistan, 2009.
Pal, U., Chaudhuri, B. B. ''Indian Script Character recognition: A survey'', Pattern Recognition, vol. 37, pp. 1887-1899, 2004..
tesseract-ocr An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. Available at: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3/ (last accessed 8 January, 2012)
Agrawal, P., Hanmandlu, M. and Lall, B., “Coarse Classification of Handwritten Hindi Characters”, International Journal of Advanced Science and Technology,Vol. 10, September, 2009.
Saba, T., Sulong, G. and Rehman, A. “A Survey on Methods and Strategies on Touched Characters Segmentation”, International Journal of Research and Reviews in Computer Science (IJRRCS) Vol. 1, No. 2, June 2010.
tesseract-ocr, available at: http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.01.hin.tar.gz&can=2&q= (last accessed 8 January, 2012)
parichit The best open source OCR for Indian Languages...yet, available at: http://code.google.com/p/parichit/downloads/detail?name=hin.traineddata (last accessed 8 January, 2012)

Index Terms

Computer Science

Information Sciences

Keywords

Tesseract Hindi OCR Shirorekha Chopping Character Segmentation