International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 186 - Number 25 |
Year of Publication: 2024 |
Authors: Alaa Najmi, Mohamed A. El-Dosuky |
10.5120/ijca2024923718 |
Alaa Najmi, Mohamed A. El-Dosuky . Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents. International Journal of Computer Applications. 186, 25 ( Jun 2024), 20-26. DOI=10.5120/ijca2024923718
Optical character recognition (OCR) is a crucial technique for extracting textual data from various sources, reducing human labor, and enhancing accessibility. Named Entity Recognition (NER) organizes and categorizes data, while Regular expression (Regex) patterning facilitates data extraction from OCR-read text. This technology reduces human labor for extracting large amounts of confidential and sensitive data, improving accessibility and preservation, especially in confidential and sensitive situations. The study utilizes the Tesseract OCR tool and the Marefa-NER NER Model, combining Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Natural Language Processing (NLP) techniques. The technologies have been successfully integrated into websites, and have proven their effectiveness in accurately identifying textual content and categorizing it using OCR, NER, and Regex patterns. The combination of OCR, NER, and Regex pattern matching has proven to be a successful and efficient method for extracting textual information from various sources, reducing human effort and improving accessibility, particularly in cases of confidentiality and sensitivity.