Prior to the advent of the current state of the art, Neural Machine Translation (NMT), Statistical Machine Translation (SMT), particularly translation model, with Phrase-based SMT was the most widely used machine translation model. This translation model has continued to be shown to outperform NMT in most cases when performing translations for low resource languages such as Nguni languages, isiXhosa and isiZulu. In this project phrase-based, SMT systems trained on data sources augmented using different techniques were compared to a baseline phrase-based SMT system, with the goal of determining whether SMT translation system quality in the low resource setting may be improved using augmented data.
Models trained on data augmented with multilingual data and back translate data were compared to a baseline SMT system
Data was normalised using a python script, lowercasing all characters in the corpora. Following this normalisation, all data was tokenized using Byte-Pair Encoding to account for the agglutinative nature of Nguni languages. Finally the corpora was cleaned to limit maximum sentence length in the models created to 80 characters. Finally 1000 sentences from the training data for each language were extracted for use as tuning data.
SMT models were built using the Moses SMT toolkit, and KenLM language modelling toolkit. A 6-gram language model with modified Kneser-Ney smoothing was also used for all systems. GIZA++ was used for word alignment, with tuning conducted for all three types of models using the MERT line-search based method.
The SMT systems trained on back-translated data yielded the best performance in both the isiZulu and isiXhosa context. However, in the isiZulu context the baseline system outperformed the multilingual system whereas the opposite is true in the isiXhosa context. Shown in the figure below are BLEU scores for each system in each target language. Using the BLEU system the closer the score is to 100, the higher the quality of the translation.
Notably, there were twice as many parallel sentences for isiXhosa, than for isiZulu justifying the overall higher BLEU scores observed in the isiXhosa context when compared to the isiZulu context.
Making use of the back-translation technique for data augmentation in particular yields translations that are of a higher quality when compared to a baseline system, trained on data that has not been augmented. Multilingual data augmentation, however, shows inconsistencies which may indicate that there is a need to find some optimal ratio of original parallel corpora to alternate language parallel corpora in order to yield improvements to the baseline context.