Email: hussein-at-cs.uct.ac.za>
Email: mkeet-at-cs.uct.ac.za
The problem that this project aims to solve is the fact that there arent that many spell checkers that are available for the African languages. It is quite evident that the research on this field is at a standstill. Research on spell checkers started off in the late 1950s, the very first spell checker for an African language was developed only in 1992. There were two spell check options found for the language isiZulu, the first was for Apache Open Office and the Mozilla firefox extensions. The last they were updated was in 2009. This might cause a problem because languages tend to change overtime and although this takes place very gradually, it is still something to take into consideration when researching a language. Thus more continuous research is needed in order to combat the evolving rate of natural languages and this also proves that as more time goes past the less the current spell checkers will be able to fully represent the isiZulu language. .
The research questions are:
. How effective is each philosophy in performing spell checking for the isiZulu language?
.Of all of the philosophies in the project, which philosophy produces the most accurate isiZulu spell checker?.
The objective of this project is to develop a set of spell checkers, two spell checkers that will use two different philosophies. These spell The two philosophies that will be used to develop the set of spell checkers will be the theory-driven linguistic model and the data-driven statistical method. The theory driven linguistic model is a rule-based system and the rules of the system will be compiled from the knowledge (theory) of the language. The rule based system will be using finite state machines, In the finite state machine, the knowledge of isiZulu is represented as a set of states and actions taken from the user are restricted by a set of rules that are set by the finite state machine. For the second spell checker we will make use of the data-driven statistical model to design and implement it. The data driven statistical model will represent the process of generating data and also embodies a set of assumptions concerning the generation of the observed data and the progress in this activity is compelled by data. The statistical model makes use of N-grams. These can be seen as adjacent sequence of n items from a sentence or word. The statistical model will make use of the word n-grams and letter n-grams. It will make use of bigrams, unigrams and trigrams. N-gram statistics will be used to identify when or where the word and character N-grams appear in the corpus. The statistical model will also make use of a corpus. A corpus can be seen as a large and structured set of electronic text. The