Machine translation (MT) refers to the use of computers to automatically translate from one language to another. This field has seen great advances in recent times aided by the advent of the internet and Neural Machine Translation. However, the translation of Nguni languages cannot boast the same rise quality. This is owed to the limited amount of training data available for these languages. As such our project aim to compare data augmentation techniques for Nguni language machine translation, Statistical and Neural Machine Translation more specifically, by comparing MT models trained on data augmented with synthetic data generated via back-translation, MT models trained on multilingual data and baseline MT models trained on English to isiZulu and English to isiXhosa corpora.
Learn MoreMachine translation is a prominent sub-field of Computational Linguistic. Its main purpose is to automatically translate text from one language to another using computers. While there are many variants of machine translation models that have been developed over the years, Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) are the most dominant models in this field. SMT has been the state of art in the machine translation paradigm in the last decades. However, it was outperformed by the NMT which showed greater improvement in translation performance over other traditional translation methods. Nonetheless, NMT has a steeper learning curve with respect to the amount of training data thus underperforms when the amount of data is limited, as in the case of low resource languages.
We used The Bilingual Evaluation Understudy (BLEU) method, proposed by Papineri et al ., to evaluate the translation performance of our models. This method is fast, inexpensive, provides an objective view and strongly correlates to human evaluation.
Monolingual corpora consisting of IsiZulu and IsiXhosa sentences and bilingual corpora consisting of aligned translation sentences in separate text files were retrieved from publicly available sources. These sources include the South African Center for Digital Language Resources (SADiLaR) where aligned parallel corpora containing translation for English to IsiXhosa and English to IsiZulu and monolingual corpora containing IsiXhosa sentences were obtained. English to IsiXhosa and English to IsiZulu parallel corpora were extracted from the JW300 parallel corpus which was retrieved from the Opus Corpus website . This corpus contains over 300 sentences for different languages and is originally stored in XML files which were converted into plain text using the opus tools . Additional parallel corpora containing translated sentences from English to IsiXhosa were retrieved from an online repository which has been made available as a result of the Medical Machine Translation project (MeMaT) . These datasets were combined into a single corpus. In addition, a subset of the c4 multilingual dataset c4 multilingual dataset containing monolingual IsiXhosa and IsiZulu corpus was used. To evaluate our models we used the Autshumato Machine Translation Evaluation set which consists of 500 sentences for every official South African language. These sentences have been translated separately by four different professional human translators. The table below contains a summary of the dataset used in this project.
Datasets | Number of Sentences | |
---|---|---|
IsiXhosa | IsiZulu | |
SADiLaR (parallel) | 126708 | 35489 |
SADiLaR (monolingual) | 233192 | - |
JW300 (parallel) | 866748 | 1046572 |
C4 (monolingual) | 597242 (subset) | 623981 (subset) |
MeMaT | 446065 (combined) | - |