According to the 2011 census, approximately 8.1 million people speak isiXhosa which accounts for about 16% of the population of SA. This is second to 23% of isiZulu speaking people in South Africa. With Nguni languages increasing their presence in the digital spectrum, there is an increasing need for spellchecking tools for Nguni languages. Very few spellcheckers exist for Nguni languages. Those in existence, have limitations in terms of scope, accuracy and functionality. Currently, there exists 2 spellcheckers for isiZulu. One developed by Ndaba et al. which performs error detection does not perform error correction. The other spellchecker.net whose creator is unknown. IsiXhosa on the other hand does not have any existing spellchecker.
The aim of this project is to provide the African Language
department at the University of Cape Town with an isiXhosa
spellchecker which performs error detection which currently does
not exist. A secondary aim is to investigate whether an error detector
implemented using a statistical based approach is more accurate
compared to a rule-based approach. In addition to this, the project
aims to investigate whether a statistical approach can be utilized
in successfully implementing an error corrector for the existing
isiZulu spellchecker developed by Ndaba et al.
The project aims to address the following research questions:
1. Is the rule based approach more accurate at detecting misspelled words compared to the statistical based approach
2. Can a statistical based error detector for on isiXhosa achieve an accuracy of 85% or more
3. Can an error corrector correct more than 85% of misspelled words in an input text using a statistical error model based on Bayes' rule
The Rule-based error detector for isiXhosa was implemented as a finite state transducer network, where we used the SFST-PL (a programming language for the tool SFST) which supports many different formats of regular expressions such as the ones used in grep, sed or Perl. Based on the morphology books that we were reading, we have developed rules for nouns, verbs, adjectives, pronouns and possessives. We have then used Java Swing for implementing the interface shown on the left.
An error detection module was created using n-gram analysis.
Character trigrams were created then stored with their
corresponding frequency from the corpus. The probability of a trigram
existing in the corpus was calculated using the formula
P(w/N) = (wfrequency/N frequency)
where w is the word and N is the total number of words in
the corpus.
The trigram frequency is then compared with a predetermined
threshold (0.003) during error detection. If the frequency of the trigram is
below the threshold, the world is flagged as incorrect otherwise,
the word is flagged as correct.
The isiZulu error corrector was developed using a statistical-based approach. Probabilities, Trigrams and the Levenshtein distance were used to obtain candidate corrections for words flagged as incorrectly spelled by the error detector. Java Swing was used to incorporate suggestions for flagged words into the existing isiZulu spellchecker interface, displayed on the left.
Siseko Neti |
Frida Mjaria |
Nthabiseng Mashiane |
Dr Maria Keet |
|||
Responsible for the implementation of the rule-based spelling error detector for isiXhosa. |
Responsible for the implementation of the spelling error corrector for isiZulu. |
Responsible for the implementation of the statistical-based spelling error detector for isiXhosa. |
Project supervisor |
|||
|
|
|
|
|||