International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 117 - Number 7 |
Year of Publication: 2015 |
Authors: Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Smita S. Pawar |
10.5120/20568-2963 |
Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Smita S. Pawar . Machine Learning based Multilingual OCR. International Journal of Computer Applications. 117, 7 ( May 2015), 27-31. DOI=10.5120/20568-2963
Paperless business has led to high speed amelioration in the world of technology. Storage, processing and retrieval of data have thus become effortless. To avoid unnecessary alterations during these phases, dossiers are stored as images or as Printable Document Format (PDF). But when real time modifications are to be made, barriers occur due to platform and script dependency, leading to complications. In this project, a generic way to overcome this problem has been presented through the concept of machine learning. A learning character set and a PDF of the identical script constitute the input. The unique features of various characters in the character set are learnt by the machine through various classifiers, and a map for the same is searched in the PDF and correspondingly profiles are generated. These classifiers distinguish the characters based on number of ripples in their patterns, number of regions and other parameters. Comparison is made between both and exact match is declared as result. This project eradicates the need to 'start from scratch' for processing newly encountered script, as observed in the conventional software due to its 'classifier reuse' strategy. It touches the social aspect in situations, where data is available with the user, but in a format in which manipulation is tiresome. In such cases, user can simply give the respective PDF and its character set as input, and obtain corresponding editable version as an output.