Transformers: Low Resource Language Modelling

Transformers

Research Question

Can the performance of transformer language models on low-resource South African languages be improved by utilising training data from multiple languages?

Training Data

South African Bantu languages can be grouped into four broad families: Nguni, Sotho-Tswana, Tswa-Ronga and Venda languages. We chose isiZulu and Sepedi as representative languages of the Nguni and Sotho-Tswana families respectively, since these were the families with the largest quantities of available training data. We used data from the National Centre for Human Language Technology (NCHLT) text project.

Multilingual GPT-2

For both isiZulu and Sepedi, we trained a GPT-2 language model with varied auxiliary data added to the training data. Auxiliary data refers to data from languages other than the target language. In the figure on the right, the "monolingual" model refers to results obtained when no auxiliary data is added. The "all languages" models are trained on all South African Bantu languages. In all results shown on the right, lower bits-per-character (BPC) scores indicate better performance. For both isiZulu and Sepedi, models trained on all languages were the best performing.

An alternative visualisation of the same data is shown on the left. This illustrates how adding auxiliary training data from other Bantu languages tends to increase performance regardless of the languages added.

Language-Specific Weights

Another modification we evaluated was the use of language-specific or language-family-specifc attention layers. In these experiments we tested making the bottom n attention layers of the model language-specific. The number of specific layers, n, was varied from 1 to 8 (all attention layers) with the resulting bits-per-character (BPC) score shown on the right. Lower scores are better. The results show that using language-specific attention layers does not improve performance. The three lines in the graph show three different combinations and groupings of training data.

Soft-Decoupled Encodings

Finally, we evaluated a more sophisticated method for including implicitly-learned language-specific representations in a multilingual model called Soft-Decoupled Encodings. In existing literature, this has been used to improve neural machine translation performance on low-resource languages. In the figure on the right, "Control" shows the performance of a multilingual model containing no language-specific representations. The other bars show the performance of models with different combinations of the components of Soft-Decoupled Encodings included. The results show that the use of language-specific representations does not improve performance.

Conclusions

We conclude that the performance of transformer language models on low-resource South African languages can be improved by utilising training data from multiple languages. Implicitly-learned language-specific representations in multilingual models do not yield performance improvements.

Future work may seek further performance improvements by using more training data from a broader group of related languages. Another potential direction is investigating the use of explicit rather than implicitly-learned language-specific representations in multilingual models, such as cross-lingual word embeddings.