CFP last date
20 January 2025
Reseach Article

Generic PDF To Text Conversion using Machine Learning

by Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Swarupa Kamble
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 106 - Number 12
Year of Publication: 2014
Authors: Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Swarupa Kamble
10.5120/18572-9405

Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Swarupa Kamble . Generic PDF To Text Conversion using Machine Learning. International Journal of Computer Applications. 106, 12 ( November 2014), 17-21. DOI=10.5120/18572-9405

@article{ 10.5120/18572-9405,
author = { Chandrahas Gaikwad, Satish Akolkar, Reshma Khodade, Deepali Dalal, Swarupa Kamble },
title = { Generic PDF To Text Conversion using Machine Learning },
journal = { International Journal of Computer Applications },
issue_date = { November 2014 },
volume = { 106 },
number = { 12 },
month = { November },
year = { 2014 },
issn = { 0975-8887 },
pages = { 17-21 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume106/number12/18572-9405/ },
doi = { 10.5120/18572-9405 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:39:13.262276+05:30
%A Chandrahas Gaikwad
%A Satish Akolkar
%A Reshma Khodade
%A Deepali Dalal
%A Swarupa Kamble
%T Generic PDF To Text Conversion using Machine Learning
%J International Journal of Computer Applications
%@ 0975-8887
%V 106
%N 12
%P 17-21
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The world is advancing to a futuristic paperless aeon. Stockpiling of logs, charters, records and other documents has become monotonous. Storage of these as 'soft copy' is more convenient and reliable. This facilitates searching and sorting with ease. Generally such documents are stored as PDF (Printable Document Format), so as to make the documents easily viewable and avoid unnecessary changes emerging due to software platforms. However, editing of locally scripted documents becomes inconvenient. The conventional PDF to text conversion software are incapable of editing some unexplored scripts. In this research paper, a generic way of making PDF documents editable by the script-independent and machine learning features is presented. This is possible by slicing out the characters from the PDF. A set of classifiers is applied to identify the character. The Decision Model implemented as a part of Machine learning systematizes the classifier functions. The resultant classifier set gives the resolution for the character. This approach eradicates the barrier of limiting our scope to international scripts and also facilitates usage of regional scripts in the technological world.

References
  1. Maayan Geffet and Yair Wiseman and Dror Feitelson, "Automatic Alphabet Recognition" School of Computer Science and Engineering, Hebrew UniversityJerusalem, Israel
  2. Jeremy Kindseth, Matthew Peterson, Muktesh Khole and Aseem Gogte"Character Recognition Using Machine Learning Techniques"
  3. Peter W. Frey and David J. Slate,"Letter Recognition Using Holland style Adaptive Classifiers" Department of Psychology, Northwestern University, Evanston, IL 60208
  4. Ivan Dervisevic"Machine Learning Methods for Optical Character Recognition".
  5. Oivind Due Trier, Anil K. Jain and Torfinn Taxt, "Feature Extraction Methods for Character Recognition""Comparison of Machine Learning Classifiers for Recognition of Online and Offline Handwritten digits"Computer Engineering and Intelligent Systems www. iiste. orgISSN 2222-1719 (Paper) ISSN 2222-2863 (Online)Vol. 4, No. 13, 2013
  6. Kirill Safronov, Dr. -Ing. Igor Tchouchenkov, Prof. Dr. -Ing. Heinz Wörn"Optical Character Recognition Using Optimisation Algorithms"Institute for Process Control and Robotics (IPR)University of Karlsruhe, Karlsruhe, Germany
  7. S. M. Kamruzzaman, "Text Classification using Artificial Intelligence"Department of Information and Communication EngineeringUniversity of Rajshahi, Rajshahi-6205, Bangladesh.
Index Terms

Computer Science
Information Sciences

Keywords

Generic Machine learning script-independent