Phrase Based Statistical Machine Translation (SMT)

by Fezeka Nzama

Prior to the advent of the current state of the art, Neural Machine Translation (NMT), Statistical Machine Translation (SMT), particularly translation model, with Phrase-based SMT was the most widely used machine translation model. This translation model has continued to be shown to outperform NMT in most cases when performing translations for low resource languages such as Nguni languages, isiXhosa and isiZulu. In this project phrase-based, SMT systems trained on data sources augmented using different techniques were compared to a baseline phrase-based SMT system, with the goal of determining whether SMT translation system quality in the low resource setting may be improved using augmented data.

Models


Models trained on data augmented with multilingual data and back translate data were compared to a baseline SMT system

  • Baseline: The baseline systems saw SMT systems trained using parallel data sourced from various online databases. A baseline system was constructed for both English-isiZulu and English-isiXhosa translation.
  • Back-translation: The back-translation systems saw a isiZulu-to-English and isiXhosa-to-English SMT translation system constructed. Using these systems, existing isiZulu and isiXhosa monolingual data was translated into English. This created synthetic parallel datasets with the nguni monolingual data mapping to the english translations. This synthetic parallel data was then combined with the existing parallel corpora, and new English-to-isiZulu and English-to-isiXhosa SMT systems trained with this new combined data set.
  • Multilingual: The multilingual system saw the isiZulu and isiXhosa parallel data combined, creating a multilingual training data set. A multilingual SMT system was constructed for both English-isiZulu and English-isiXhosa translation.

  • Data-Preprocessing


    Data was normalised using a python script, lowercasing all characters in the corpora. Following this normalisation, all data was tokenized using Byte-Pair Encoding to account for the agglutinative nature of Nguni languages. Finally the corpora was cleaned to limit maximum sentence length in the models created to 80 characters. Finally 1000 sentences from the training data for each language were extracted for use as tuning data.


    Model Construction


    SMT models were built using the Moses SMT toolkit, and KenLM language modelling toolkit. A 6-gram language model with modified Kneser-Ney smoothing was also used for all systems. GIZA++ was used for word alignment, with tuning conducted for all three types of models using the MERT line-search based method.


    Results


    The SMT systems trained on back-translated data yielded the best performance in both the isiZulu and isiXhosa context. However, in the isiZulu context the baseline system outperformed the multilingual system whereas the opposite is true in the isiXhosa context. Shown in the figure below are BLEU scores for each system in each target language. Using the BLEU system the closer the score is to 100, the higher the quality of the translation.


    results for smt

    Notably, there were twice as many parallel sentences for isiXhosa, than for isiZulu justifying the overall higher BLEU scores observed in the isiXhosa context when compared to the isiZulu context.



    Conclusions


    Making use of the back-translation technique for data augmentation in particular yields translations that are of a higher quality when compared to a baseline system, trained on data that has not been augmented. Multilingual data augmentation, however, shows inconsistencies which may indicate that there is a need to find some optimal ratio of original parallel corpora to alternate language parallel corpora in order to yield improvements to the baseline context.