CFP last date
20 January 2025
Reseach Article

Multilingual OCR (MOCR): An Approach to Classify Words to Languages

by Mohammad Abu Obaida, Md. Jakir Hossain, Momotaz Begum, Md. Shahin Alam
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 32 - Number 1
Year of Publication: 2011
Authors: Mohammad Abu Obaida, Md. Jakir Hossain, Momotaz Begum, Md. Shahin Alam
10.5120/3872-5414

Mohammad Abu Obaida, Md. Jakir Hossain, Momotaz Begum, Md. Shahin Alam . Multilingual OCR (MOCR): An Approach to Classify Words to Languages. International Journal of Computer Applications. 32, 1 ( October 2011), 46-53. DOI=10.5120/3872-5414

@article{ 10.5120/3872-5414,
author = { Mohammad Abu Obaida, Md. Jakir Hossain, Momotaz Begum, Md. Shahin Alam },
title = { Multilingual OCR (MOCR): An Approach to Classify Words to Languages },
journal = { International Journal of Computer Applications },
issue_date = { October 2011 },
volume = { 32 },
number = { 1 },
month = { October },
year = { 2011 },
issn = { 0975-8887 },
pages = { 46-53 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume32/number1/3872-5414/ },
doi = { 10.5120/3872-5414 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:18:04.166988+05:30
%A Mohammad Abu Obaida
%A Md. Jakir Hossain
%A Momotaz Begum
%A Md. Shahin Alam
%T Multilingual OCR (MOCR): An Approach to Classify Words to Languages
%J International Journal of Computer Applications
%@ 0975-8887
%V 32
%N 1
%P 46-53
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

There are immense efforts to design a complete OCR for most of the world’s leading languages, however, multilingual documents either of handwritten or of printed form. As a united attempt, Unicode based OCRs were studied mostly with some positive outcomes, despite the fact that a large character set slows down the recognition significantly. In this paper, we come out with a method to classify words to a language as the word segmentation is complete. For the purpose, we identified the characteristics of writings of several languages and utilized projecting method combined with some other feature extraction methods. In addition, this paper intends a modified statistical approach to correct the skewness before processing a segmented document. The proposed procedure, evaluated for a collection of both handwritten and printed documents, came with excellent outcomes in assigning words to languages.

References
  1. Manivannan Arivazhagan, H. Srinivasan, and S. N. Srihari: “A Statistical Approach to Handwritten Line Segmentation”, in Document Recognition and Retrieval XIV, Proceedings of SPIE, San Jose, CA, pp. 6500T-1-11, February 2007.
  2. Richard O. Duda, Peter E. Hart: “Use of the Hough Transform to deterct lines and curves in pictures”, Technical Note 36, AI Center, April 1971.
  3. Srihari, S.N. and V. Govindaraju: “Analysis of textual images using the Hough transforms”, Machine Vision Applications, 2: pp. 141-153, 1989. DOI: 10.1007/ BF01212455.
  4. D.S. Le, G.R. Thoma and H. Wechsler: “Automatic page orientation and skew angle detection for binary document images” Pattern Recognition, 27: pp. 1325-1344, 1994.
  5. U. Pal and B.B. Chaudhuri: “An improved document skew angle estimation technique”, Pattern Recognition Lett., 17: 899-904, 1996. DOI: 10.1016/0167-8655(96)00042-6
  6. Yu, B. and A.K. Jain: “A robust and fast skew detection algorithm for generic documents”, Pattern Recog., 29: 1599-1629, 1996. DOI: 10.1016/0031-3203(96)00020-9
  7. Mohammad Abu Obaida, Tanay Kumar Roy, Md. Abu Horaira and Md. Jakir Hossain: “Skew Correction Function of OCR: Stroke-Whitespace based Algorithmic Approach”, International Journal of Computer Applications (IJCA) 28(8):pp. 7-12, NY, USA, August 2011.
  8. H.S. Hou: “Digital Document Processing” Wisely New York, ISBN: 0471862479, 1983.
  9. Akiyama, T. and N. Hagita: “Automated entry system for printed documents”, Pattern Recognition, 23: 1141-1158, 1990. DOI: 10.1016/0031-3203(90)90112-X
  10. Omar, K., A. Ramli, R. Mahmod and M. Sulaiman: “Skew detection and correction of jawi images using gradient direction”, Journal of Technology, 37: 117-126, 2002.
  11. Hashizume, A., P.S. Yeh and A. Cosenfeld: “A method of detecting the orientation of aligned components”, Pattern Recognition Letters, 4: 125-132, 1986.
  12. Yan, H.: “Skew correction of document images using interline cross correlation”, Computer Vision Graph. Image Processing, 55: 538-543, 1993, DOI: 10.1006/cgip.1993.1041.
  13. Laurence Likforman-Sulem, Anahid Hanimyan, Claudie Faure: “A Hough Based Algorithm for Extracting Text Lines in Handwritten Documents”, Proceedings of the Third International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 774-777, 1995.
  14. G. Louloudis, K. Halatsis, B. Gatos, I. Pratikakis : “A Block-Based Hough Transform Mapping for Text Line Detection in Handwritten Documents”, 10th International Workshop on Frontiers in Handwriting Recognition (IWFHR 2006), La Baule, France, pp. 515-520, October 2006.
  15. Y. Pu and Z. Shi: “A Natural Learning Algorithm Based on Hough Transform for Text Lines Extraction in Handwritten Documents”, Proceedings of the 6th International Workshop on Frontiers in Handwriting Recognition, Taejon, Korea, pp. 637-646, 1998.
  16. Esra Ataer, Pinar Duygulu: “Retrieval of Ottoman Documents”, Proceedings of 8th ACM SIGMM International Workshop on Multimedia Information Retrieval , October 26-27, Santa Barbara, CA, USA, 2006.
  17. Elisabetta Bruzzone, Meri Cristina Coffetti: “An Algorithm for Extracting Cursive Text Lines”, Proceedings of the Fifth International Conference on Document Analysis and Recognition, Bangalore, India, pp. 749, 1999.
  18. R. Manmatha, J. L. Rothfeder: “A Scale Space Approach for Automatically Segmenting Words from Historical Handwritten Documents”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.27, No.8, pp. 1212-1225, August 2005.
  19. U. Mahadevan, R. C. Nagabushnam: “Gap metrics for word separation in handwritten lines”, Third International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 124-127, 1995.
  20. Z. Shi, V. Govindaraju: “Line Separation for Complex Document Images Using Fuzzy Runlength”, First International Workshop on Document Image Analysis for Libraries, pp. 306, 2004.
  21. Wahl, F.M., Wong, K.Y., Casey R.G.: “Block Segmentation and Text Extraction in Mixed Text/Image Documents”, Computer Graphics and Image Processing, 20, pp. 375-390, 1982.
  22. Berrin A. Yanikoglu, Peter A. Sandon: “Segmentation of off-line cursive handwriting using linear programming”, Pattern Recognition 31, pp. 1825-1833, 1998.
  23. Ergina Kavallieratou, N. Dromazou, Nikos Fakotakis, George K. Kokkinakis: “An integrated system for handwritten document image processing”, IJPRAI, International Journal of Pattern Recognition and Artificial Intelligence 17(4), pp. 101-120, 2003.
  24. C. Weliwitage, A. L. Harvey, B. Jennings: “Handwritten document offline text line segmentation”, Digital Image Computing: Techniques and Applications, DICTA, pp. 184-187, December 2005.
  25. S. Nicolas, T. Paquet, L. Heutte: “Text Line Segmentation in Handwritten Document Using a Production System”, Proceedings of the 9th IWFHR, Tokyo, Japan, pp. 245-250, 2004.
  26. Z. Shi, S. Setlur, and V. Govindaraju: “Text Extraction from Gray Scale Historical Document Images Using Adaptive Local Connectivity Map”, Eighth International Conference on Document Analysis and Recognition, Seoul, Korea, pp. 794-798, 2005.
  27. Markus Feldbach and Klaus D. Tönnies: “Line detection and segmentation in historical church registers”, Sixth International Conference on Document Analysis and Recognition, Seattle, USA, IEEE Computer Society , pp. 743-747, September 2001.
  28. Jisheng Liang, Ihsin T. Phillips and Robert M. Haralick: “A statistically based, highly accurate text line segmentation method”, Proceedings 5th ICDAR, pp. 551-554, 1999.
  29. Douglas J. Kennard, William A. Barrett: “Separating lines of text in free-form handwritten historical documents”, Second International Conferenceon Document Image Analysis for Libraries (DIAL), pp. 12-23, April 2006.
  30. Stéphane Nicolas, Thierry Paquet, Laurent Heutte: “Text line segmentation in handwritten document using a production system”, Ninth International Work-shop on Frontiers in Handwriting Recognition, IWFHR-9 26-29, pp. 245-250, October 2004.
  31. U. V. Marti, H. Bunke: “Text Line Segmentation and Word Recognition in a System for General Writer Independent Handwriting Recognition”, Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, pp. 159-163, 2001.
  32. G. Seni, E. Cohen: “External Word Segmentation of Off-line Handwritten Text Lines”, Pattern Recognition, 27(1): 41-52, 1994.
  33. F. Luthy, T. Varga, H. Bunke: “Using Hidden Markov Models as a Tool for Handwritten Text Line Segmentation”, Ninth International Conference on Document Analysis and Recognition, Curitiba, Brazil, pp. 8-12, 2007.
Index Terms

Computer Science
Information Sciences

Keywords

OCR Multilingual OCR MOCR Classification