International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 184 - Number 49 |
Year of Publication: 2023 |
Authors: Dewi Soyusiawaty, Bella Okta Sari Miranda |
10.5120/ijca2023922603 |
Dewi Soyusiawaty, Bella Okta Sari Miranda . Statistical Machine Translation from Indonesian to Regional Languages in Indonesia. International Journal of Computer Applications. 184, 49 ( Mar 2023), 18-23. DOI=10.5120/ijca2023922603
The current condition in Indonesia has 617 regional languages. There are 15 regional languages that are declared extinct and 139 others are in endangered status. Utilization of computer-based tools can be used as an effort to preserve regional languages digitally according to current technological developments, including by building digital dictionaries and translation machines. The digital dictionary has the ability to translate regional languages into Indonesian with the approach used is translating word for word, although it is not effective when done manually. An alternative solution is to create a machine translation application. Machine translation can be dictionary-based or language-parallel corpus data-based. Statistical Machine Translation (SMT) is a machine translation approach with translation results generated on the basis of a statistical model whose parameters are taken from the results of a parallel corpus analysis. The quality of the SMT translation results is influenced by several factors. The most fundamental factor is the number of parallel corpus available and the quality of the corpus used as the basis for building translation models and language models. This study aims to determine the role of parallel corpus in improving SMT accuracy, especially in regional languages in Indonesia. The research data used is parallel corpus text of 3000 pairs of sentences. Based on the results of the research that has been done, it is found that the optimization of parallel corpus can increase the value of translation accuracy. Better translation accuracy can be achieved with optimized parallel corpus. Besides that, testing with single sentences will provide higher accuracy than using compound sentences. Testing of 3000 random parallel corpus parallels can increase accuracy by 11.4%, higher than testing with 3000 random parallel corpus.