International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 184 - Number 31 |
Year of Publication: 2022 |
Authors: Mohamed Taybe Elhadi |
10.5120/ijca2022922371 |
Mohamed Taybe Elhadi . Arabic Minimal Pairs Word Detection and Disambiguation. International Journal of Computer Applications. 184, 31 ( Oct 2022), 11-20. DOI=10.5120/ijca2022922371
This work is an attempt to solve a common writing problem and pitfall for Arabic language. The problem involves words that contain letters such as (among many others) ظ THA and ض DHA. The problem involves terms that are formally minimal pairs (more precisely near minimal pairs), near homographs (homophones), it requires determination of the right term and resolutions of created ambiguities. It is not just embarrassing to the authors, but in many situations, it results in wrong usage of words and consequently can lead to an ambiguous sentence(s). It becomes difficult to interpret such words or sentences, especially by computer involved in applications such as information retrieval, language translations and summarizations. A very amalgamated determination process was suggested that is comprised of multiple stages of feature selection, classifier selection and classification. A sample set of terms selected with a reasonable success rate making classifiers accuracies vary, but overall, all terms are reasonably accurate and close in values. MCC values are also variable with some reasonable good ranges. It is notable that some classifiers did not converge and the MCC is set to zero. Considering results obtained from classifiers with highest training rates and those with highest MCC, It can be easily concluded that Random Forest algorithm is the champion with high accuracy in most of the terms, and many times very close to the highest rate, classifiers were close. It also scored the highest for the mean values calculation across all terms. We can easily say that a combination of extracted features from a corpus along with machine learning classification techniques, the problem can be solved with high accuracy.