ALSPEL

Overview

According to the 2011 census, approximately 8.1 million people speak isiXhosa which accounts for about 16% of the population of SA. This is second to 23% of isiZulu speaking people in South Africa. With Nguni languages increasing their presence in the digital spectrum, there is an increasing need for spellchecking tools for Nguni languages. Very few spellcheckers exist for Nguni languages. Those in existence, have limitations in terms of scope, accuracy and functionality. Currently, there exists 2 spellcheckers for isiZulu. One developed by Ndaba et al. which performs error detection does not perform error correction. The other spellchecker.net whose creator is unknown. IsiXhosa on the other hand does not have any existing spellchecker.

Project Objectives

The aim of this project is to provide the African Language department at the University of Cape Town with an isiXhosa spellchecker which performs error detection which currently does not exist. A secondary aim is to investigate whether an error detector implemented using a statistical based approach is more accurate compared to a rule-based approach. In addition to this, the project aims to investigate whether a statistical approach can be utilized in successfully implementing an error corrector for the existing isiZulu spellchecker developed by Ndaba et al.

Research Questions

The project aims to address the following research questions:
1. Is the rule based approach more accurate at detecting misspelled words compared to the statistical based approach
2. Can a statistical based error detector for on isiXhosa achieve an accuracy of 85% or more
3. Can an error corrector correct more than 85% of misspelled words in an input text using a statistical error model based on Bayes' rule

Rule-Based Error Detector

The Rule-based error detector for isiXhosa was implemented as a finite state transducer network, where we used the SFST-PL (a programming language for the tool SFST) which supports many different formats of regular expressions such as the ones used in grep, sed or Perl. Based on the morphology books that we were reading, we have developed rules for nouns, verbs, adjectives, pronouns and possessives. We have then used Java Swing for implementing the interface shown on the left.

Statistical based

An error detection module was created using n-gram analysis. Character trigrams were created then stored with their corresponding frequency from the corpus. The probability of a trigram existing in the corpus was calculated using the formula
P(w/N) = (wfrequency/N frequency)
where w is the word and N is the total number of words in the corpus. The trigram frequency is then compared with a predetermined threshold (0.003) during error detection. If the frequency of the trigram is below the threshold, the world is flagged as incorrect otherwise, the word is flagged as correct.

IsiZulu Error Corrector

The isiZulu error corrector was developed using a statistical-based approach. Probabilities, Trigrams and the Levenshtein distance were used to obtain candidate corrections for words flagged as incorrectly spelled by the error detector. Java Swing was used to incorporate suggestions for flagged words into the existing isiZulu spellchecker interface, displayed on the left.

The Team

Siseko Neti

Frida Mjaria

Nthabiseng Mashiane

Dr Maria Keet

Responsible for the implementation of the rule-based spelling error detector for isiXhosa.

Responsible for the implementation of the spelling error corrector for isiZulu.

Responsible for the implementation of the statistical-based spelling error detector for isiXhosa.

Project supervisor

3443307@myuwc.ac.za

fmjaria@gmail.com

nthabimashiane@gmail.com

mkeet@cs.uct.ac.za