International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 93 - Number 17 |
Year of Publication: 2014 |
Authors: Shailesh A. Chaudhari, Ravi M. Gulati |
10.5120/16431-6212 |
Shailesh A. Chaudhari, Ravi M. Gulati . Script Identification from Bilingual Gujarati-English Documents. International Journal of Computer Applications. 93, 17 ( May 2014), 35-40. DOI=10.5120/16431-6212
In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script identification of Indian language i. e. Gujarati and Roman scripts for printed documents. Here authors propose the line-wise script identification. The spatial spread of pixels on Upper and Lower parts associated with Gujarati and English are used to identify the script. Authors have used horizontal projection for line distinction belonging to different script. Further, K-nearest neighbour algorithm is used to classify 2000 text lines into two scripts: English and Gujarati, based on 4 spatial spread features extracted using connected component and horizontal projection. The proposed algorithm achieves average classification accuracy as high as 99. 70% for bi-script separation.