International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 58 - Number 17 |
Year of Publication: 2012 |
Authors: Ohnmar Htun, andrew Finch, Eiichiro Sumita, Yoshiki Mikami |
10.5120/9373-3821 |
Ohnmar Htun, andrew Finch, Eiichiro Sumita, Yoshiki Mikami . Improving Transliteration Mining by Integrating Expert Knowledge with Statistical Approaches. International Journal of Computer Applications. 58, 17 ( November 2012), 12-22. DOI=10.5120/9373-3821
This paper contributes a study of methods for integrating human expert knowledge with machine learning approaches for determining phonetic similarity of word pairs. A method is proposed which allows a human to provide a structure for the edit costs that are based around a phonetically-motivated model of phoneme sound groups, and the machine to determine precise values for these costs within two different frameworks based on stochastic edit distance: a method based on one-to-one expectation maximization (EM) alignment and a Bayesian many-to-many alignment approach. A preliminary study is within the context of cross-language word similarity in transliteration mining. The experiments were performed on a Myanmar-English mining task; the principle approach is expected to be most useful for low-resource language pairs, where human expert knowledge can compensate for a lack of data resources. The results show that the approach outperforms baseline systems based on only human knowledge and only on machine learning. This approach showed the choice of edit cost is a strong factor in determining the performance of the edit-distance-based techniques used in these experiments. The learned edit costs consistently outperformed a simple set of plausible costs selected by a human expert. Furthermore, providing a structure to the weights for the machine learning process reduced the number of parameters to be learned simplifying and speeding up the learning task. This method is expected to mitigate issues with data sparseness when learning models for low-resource languages. The reduction in the number of model parameters led to improvements in recall in these experiments, even though the model was considerably smaller, validating the choice of structure.