Currently, not many African languages are supported on existing word processors. IsiZulu, IsiXhosa, Sepedi, SeSotho, Setswana, TshiVenda, XiTsonga, IsiNdebele and IsiSwati are collectively known as Bantu languages and are the largest language group in South Africa. Within these Bantu languages exists a sub-group known as Nguni wherein the language being focused on-isiXhosa - is found. According to the 2011 census of South Africa, approximately 8.1 million people speak isiXhosa which accounts for 16% of the population of South Africa, second to 23% of isiZulu speakers in South Africa. This project aimed to create a spellchecker for isiXhosa that can correctly perform isolated non-word error detection as there is currently no standalone spellchecker for isiXhosa. Secondly, this project aimed to investigate the accuracy of the isiXhosa spellchecker and assess whether it can achieve the same accuracy or exceed that of the standalone isiZulu spellchecker created by Ndaba et al. which has an accuracy of 89%.
The error detector used n-gram analysis, particularly character trigrams.
These trigrams were created from the word corpus received from Dr Mantoa-Masoko (the
client) which was combined with an online corpus downloaded from the North West University digital languages
resources website which can be found
The corpus has to be clean and prepared for use by the spellchecker. This is an important step as the
documents used to create the corpus directly affect the performance of the spellchecker.
A clean corpus is one which does not contain words of another language and punctuation.
A cleaner corpus leads to better results. Although no spellchecker is 100% accurate, clean and
large corpora improve the performance of the spellchecker.
A trigram is a set of 3 consecutive characters. For example, if the input word is
“Molweni”, for trigrams the output would be “Mol”, “olw”,
“lwe”, “wen”, “eni”. In addition to storing these tree structures , the frequency of each n-gram was
stored as a key-value pair.
A user interface design that was as
simple as possible was creted in order to avoid overwhelming and cluttering
the user. The design process followed was expert-mindset.
To evaluate the usability and look of the spellchecker, a usability study was conducted. A group of participants that comprised of students at the University of Cape Town who study isiXhosa at university level or have studied IsiXhosa until grade 12 were asked to partake in the usability testing. The testing session was 30 minutes long. During this time, users were observed as they used the tool to see how quickly they could understand the tasks to be completed and how they proceeded to begin and complete the task. Users were allowed to give feedback in real-time as they used the tool during the study
The statistical approach for the isiXhosa spellchecker performs poorly in comparison to the isiZulu spellchecker. This approach was chosen as it was used by Ndaba et al. for the isiZulu spellchecker and performed very well giving an accuracy of 89%. With isiXhosa and isiZulu belonging to the same language group, the Nguni group. The two languages have similarities in their language structure thus, it was expected that the statistical approach would work well for both languages but the isiXhosa error detector performed worse in comparison to the isiZulu error detector achieving an accuracy of 79%. It was also established that despite the similarities in the language, the isiZulu trigrams cannot be used for the isiXhosa trigrams.
This project aimed to create a data-driven statistical based model to perform isolated non-word error detection for an isiXhosa spellchecker. This tool was successfully created as a desktop application and achieves an accuracy of 79%. A secondary aim of this project was to investigate whether the isiXhosa spellchecker can achieve an accuracy similar to that of the isiZulu spellchecker. The isiXhosa spellchecker gave a lower accuracy or 79% in comparison with the isiZulu spellchecker which has an accuracy of 89%.