Low-LM

Language Modelling for Low Resource South African Languages

LowLM is a UCT Honours project, created by Jared Shapiro, Luc Hayward and Stuart Mesham

The project was supervised by Dr. Jan Buys.

Language Models for Low Resource South African Languages

Language Model Illustration

Given a sequence of context words, a language model predicts the next word in the sentence. More formally, a language model assigns a probability to a sequence of words. Modern language models are trained on large datasets, however, many of South Africa’s languages are low resource – there is little text data available for training language models. We evaluate different language models and training methods for modelling South African languages.

Data Acquisition

The Sotho-Tswana and Nguni language families are two of the largest language families in South Africa, as such Sepedi and isiZulu respectively were chosen as representative languages due to having the largest dataset of similar size of the potential languages. The datasets were collected from three separate sources:

DatasetDescription
NCHLT The South African Centre for Digital Language Resources (SADiLaR) provided the NCHLT dataset and provides monolingual corpora for all 11 of South Africa's official languages. A significant portion of these are scraped from governmental websites.
Isolezwe The Newstools Isolezwe corpus provided a repository of news articles published in Sepedi.
Autshumato The Autshumato dataset is approximately one third the size of the other datasets by total sentence count, and is drawn from the South African Government domain and presented as parallel corpora between English and a number of other South African Languages.