CFP last date
22 July 2024
Reseach Article

Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents

by Alaa Najmi, Mohamed A. El-Dosuky
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 25
Year of Publication: 2024
Authors: Alaa Najmi, Mohamed A. El-Dosuky
10.5120/ijca2024923718

Alaa Najmi, Mohamed A. El-Dosuky . Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents. International Journal of Computer Applications. 186, 25 ( Jun 2024), 20-26. DOI=10.5120/ijca2024923718

@article{ 10.5120/ijca2024923718,
author = { Alaa Najmi, Mohamed A. El-Dosuky },
title = { Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents },
journal = { International Journal of Computer Applications },
issue_date = { Jun 2024 },
volume = { 186 },
number = { 25 },
month = { Jun },
year = { 2024 },
issn = { 0975-8887 },
pages = { 20-26 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number25/optical-character-recognition-and-named-entity-recognition-for-highly-confidential-documents/ },
doi = { 10.5120/ijca2024923718 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-06-27T00:56:46.510688+05:30
%A Alaa Najmi
%A Mohamed A. El-Dosuky
%T Optical Character Recognition and Named Entity Recognition for Highly Confidential Documents
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 25
%P 20-26
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Optical character recognition (OCR) is a crucial technique for extracting textual data from various sources, reducing human labor, and enhancing accessibility. Named Entity Recognition (NER) organizes and categorizes data, while Regular expression (Regex) patterning facilitates data extraction from OCR-read text. This technology reduces human labor for extracting large amounts of confidential and sensitive data, improving accessibility and preservation, especially in confidential and sensitive situations. The study utilizes the Tesseract OCR tool and the Marefa-NER NER Model, combining Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Natural Language Processing (NLP) techniques. The technologies have been successfully integrated into websites, and have proven their effectiveness in accurately identifying textual content and categorizing it using OCR, NER, and Regex patterns. The combination of OCR, NER, and Regex pattern matching has proven to be a successful and efficient method for extracting textual information from various sources, reducing human effort and improving accessibility, particularly in cases of confidentiality and sensitivity.

References
  1. Satti, Danish Altaf. "Offline Urdu Nastaliq OCR for printed text using analytical approach." MS thesis report (2013): 141.
  2. Al-Badr, Badr, and Sabri A. Mahmoud. "Survey and bibliography of Arabic optical text recognition." Signal processing 41, no. 1 (1995): 49-77.
  3. Grishman, Ralph, and Beth M. Sundheim. "Message understanding conference-6: A brief history." In COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics. 1996.
  4. Lee S, Lee G, 2005, Proceedings of the International Joint Conference on Natural Language Processing, October 11-13, 2005: Heuristic Methods for Reducing Errors of Geographic Named Entities Learned by 68 Volume 6; Issue 5 Bootstrapping. Springer Verlag, Jeju Island, Korea, 658-669.
  5. Liu, Xing, Huiqin Chen, and Wangui Xia. "Overview of named entity recognition." Journal of Contemporary Educational Research 6, no. 5 (2022): 65-68.
  6. Kukreja, Harsh, N. Bharath, C. S. Siddesh, and S. Kuldeep. "An introduction to artificial neural network." Int J Adv Res Innov Ideas Educ 1 (2016): 27-30.
  7. Benítez-Peña, Sandra, Rafael Blanquero, Emilio Carrizosa, and Pepa Ramírez-Cobo. "Cost-sensitive probabilistic predictions for support vector machines." European Journal of Operational Research (2023).
  8. Hannan, Shaikh Abdul, Jameel Ahmed, Naveed Ahmed, and Rizwan Alam Thakur. "Data Mining and Natural Language Processing Methods for Extracting Opinions from Customer Reviews." International Journal of Computational Intelligence and Information Security: 52-58.
  9. Sætre, Rune. "GeneTUC: Natural Language Understanding in Medical Text." (2006).
  10. Zollmann, Andreas, Ashish Venugopal, and Stephan Vogel. "Bridging the inflection morphology gap for Arabic statistical machine translation." In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 201-204. 2006.
  11. Saif, Abdulgabbar Mohammed, and Mohd Juzaiddin Ab Aziz. "An automatic noun compound extraction from Arabic corpus." In 2011 International Conference on Semantic Technology and Information Retrieval, pp. 224-230. IEEE, 2011.
  12. Zouaghi, Anis, Laroussi Merhbene, and Mounir Zrigui. "Combination of information retrieval methods with LESK algorithm for Arabic word sense disambiguation." Artificial Intelligence Review 38, no. 4 (2012): 257-269.
  13. A. Saif, M. J. Ab Aziz, and N. Omar, "Evaluating knowledge-based semantic measures on Arabic," International Journal on Communications Antenna and Propagation, vol. 4, pp. 180-194, 2014.
  14. Saif, Abdulgabbar, Mohd Juzaiddin Ab Aziz, and Nazlia Omar. "Mapping Arabic WordNet synsets to Wikipedia articles using monolingual and bilingual features." Natural Language Engineering 23, no. 1 (2017): 53-91.
  15. Alshaikhdeeb, Basel, and Kamsuriah Ahmad. "Biomedical named entity recognition: a review." International Journal on Advanced Science, Engineering and Information Technology 6, no. 6 (2016): 889-895.
  16. Awel, Muna Ahmed, and Ali Imam Abidi. "Review on optical character recognition." International Research Journal of Engineering and Technology (IRJET) 6, no. 6 (2019): 3666-3669.
  17. Islam, Noman, Zeeshan Islam, and Nazia Noor. "A survey on optical character recognition system." arXiv preprint arXiv:1710.05703 (2017).
  18. Salah, Ramzi Esmail, and L. Qadri binti Zakaria. "A comparative review of machine learning for Arabic named entity recognition." International Journal on Advanced Science, Engineering and Information Technology 7, no. 2 (2017): 511-518.
  19. Marefa Arabic Named Entity Recognition Model (huggingface.co/marefa-nlp/marefa-ner), Last access 2023/02/08.
Index Terms

Computer Science
Information Sciences
OCR
NER
Regex
Confidential documents
Sensitive data

Keywords

OCR NER Regex Confidential documents Sensitive data