We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

Script Identification from Bilingual Gujarati-English Documents

by Shailesh A. Chaudhari, Ravi M. Gulati
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 93 - Number 17
Year of Publication: 2014
Authors: Shailesh A. Chaudhari, Ravi M. Gulati
10.5120/16431-6212

Shailesh A. Chaudhari, Ravi M. Gulati . Script Identification from Bilingual Gujarati-English Documents. International Journal of Computer Applications. 93, 17 ( May 2014), 35-40. DOI=10.5120/16431-6212

@article{ 10.5120/16431-6212,
author = { Shailesh A. Chaudhari, Ravi M. Gulati },
title = { Script Identification from Bilingual Gujarati-English Documents },
journal = { International Journal of Computer Applications },
issue_date = { May 2014 },
volume = { 93 },
number = { 17 },
month = { May },
year = { 2014 },
issn = { 0975-8887 },
pages = { 35-40 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume93/number17/16431-6212/ },
doi = { 10.5120/16431-6212 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:16:21.563704+05:30
%A Shailesh A. Chaudhari
%A Ravi M. Gulati
%T Script Identification from Bilingual Gujarati-English Documents
%J International Journal of Computer Applications
%@ 0975-8887
%V 93
%N 17
%P 35-40
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In a multi-lingual country like India, in most of the official papers, school text books, magazines, it is observed that English words intersperse within the Indian regional languages. So a bilingual Optical Character Recognition (OCR) system is needed which can recognize these bilingual documents and store it for future use. In this paper authors present an OCR system developed for the script identification of Indian language i. e. Gujarati and Roman scripts for printed documents. Here authors propose the line-wise script identification. The spatial spread of pixels on Upper and Lower parts associated with Gujarati and English are used to identify the script. Authors have used horizontal projection for line distinction belonging to different script. Further, K-nearest neighbour algorithm is used to classify 2000 text lines into two scripts: English and Gujarati, based on 4 spatial spread features extracted using connected component and horizontal projection. The proposed algorithm achieves average classification accuracy as high as 99. 70% for bi-script separation.

References
  1. L. Spitz. "Determination of the Script and Language Content of Document Images". IEEE Trans. on PAMI, 235-245, 1997
  2. J. Ding, L. Lam, and C. Y. Suen. "Classification of Oriental and European Scripts by using Characteristic Features". In Proceedings of 4th ICDAR, pp. 1023-1027, 1997
  3. D. Dhanya, A. G. Ramakrishna, and P. B. Pati. " Script Identification in Printed Bilingual Documents". Sadhana, 27(1): 73-82, 2002
  4. J. Hochberg, P. Kelly, T. Thomas, and L. Kerns. "Automatic script Identification from Document Images using Cluster-Based Templates" IEEE Trans. on PAMI, 176-181, 1997
  5. T. N. Tan. "Rotation Invariant Texture Features and their use in Automatic Script Identification". IEEE Trans. On PAMI, 751-756, 1998
  6. S. Wood, X. Yao, and K. Krishnamurthi, , L. Dang. "Language Identification for Printed Text Independent of Segmentation". In Proc. Int'l Conf. on Image Processing. 428-431, 1995
  7. U. Pal, and B. B Chaudhuri,. "Script Line Separation from Indian Multi-Script Documents". IETE Journal of Research, 49, 3-11, 2003
  8. U. Pal, S. Sinha, and B. B. Chaudhuri. "Multi-Script Line identification from Indian Documents". In Proceedings 7th ICDAR, 880--884, 2003
  9. S. Chanda, U. Pal, "English, Devnagari and Urdu Text Identification". Proc. International Conference on Cognition and Recognition, 538-545, 2005
  10. S. V. Rajashekararadhya, Dr P. Vanaja Ranjan, "Handwritten Numeral/Mixed Numerals Recognition Of South-Indian Scripts: The Zonebased Feature Extraction Method" Journal of Theoretical and Applied Information Technology, 2009, Vol 7. No 1.
  11. M. C. Padma, P. A. Vijaya, P. Nagabhushan, "Language Identification from an Indian Multilingual Document Using Profile Features", International Conference on Computer and Automation Engineering, IEEE, 2009, 978-0-7695-3569-2.
  12. Bindu Philip and R. D. Sudhaker Samuel, "A Novel Bilingual OCR for Printed Malayalam-English Text based on Gabor Features and Dominant Singular Values", International Conference on Digital Image Processing, IEEE, 2009, 978-0-7695-3565-4/09.
  13. S. Chaudhari, R. Gulati, "Character Level Separation and Identification of English and Gujarati Digits from Bilingual (English-Gujarati) Printed Documents", International Journal of computer applications(IJCA), NewYork, USA, 2012.
  14. S. Chaudhari, R. Gulati, "An OCR for Separation and Identification of Mixed English - Gujarati Digits using kNN Classifier", Proc. International Conference on Intelligent Systems and Signal Processig, 2013.
  15. N. Otsu, " A threshold selection method from gray level histogram ", IEEE Trans. Syst. Man Cyb, Vol. 9, no. 1, pp. 62-66, 1979.
Index Terms

Computer Science
Information Sciences

Keywords

Pre-processing Segmentation Vector kNN Classifier etc.