Language Models for Low Resource South African Languages
Given a sequence of context words, a language model predicts the next word in the sentence. More formally, a language model assigns a probability to a sequence of words. Modern language models are trained on large datasets, however, many of South Africa’s languages are low resource – there is little text data available for training language models. We evaluate different language models and training methods for modelling South African languages.
Data Acquisition
The Sotho-Tswana and Nguni language families are two of the largest language families in South Africa, as such Sepedi and isiZulu respectively were chosen as representative languages due to having the largest dataset of similar size of the potential languages. The datasets were collected from three separate sources:
Dataset | Description |
---|---|
NCHLT | The South African Centre for Digital Language Resources (SADiLaR) provided the NCHLT dataset and provides monolingual corpora for all 11 of South Africa's official languages. A significant portion of these are scraped from governmental websites. |
Isolezwe | The Newstools Isolezwe corpus provided a repository of news articles published in Sepedi. |
Autshumato | The Autshumato dataset is approximately one third the size of the other datasets by total sentence count, and is drawn from the South African Government domain and presented as parallel corpora between English and a number of other South African Languages. |