An isiZulu spellchecker

Using rule based linguistic model

Introduction

This spell checker uses the morphological rules of the language to create the spell checker. The rules are implemented using finite state automata and regular expressions. Using these tools will allow us to create a morphological analyzer that we can use to perform error recall (finding incorrectly spelled words within the the user's input) and lexical recall (finding correctly spelled words within the user's input. These two in the context of spell checkers, are the most common functions any spell checker can have.
The rules of the language are shown using finite state automata. Finite state automata is a type of device that has a number of states that are represented in the device. These states monitor and model the behaviour of the system, thus there are only a limited number of conditions that the system can accept. The finite state automata has to have at least 2 states, the start state and the end state. The multiple states that are in the middle of the start and end state are referred to as arcs. Now using the morphological make up of the language, we can create a finite state automata that expresses all of the conditions/rules that the system would allow. There are two types of finite state automata, deterministic finite state automata and non deterministic finite state automata.
Deterministic finite state automata (DFA) has more conditions to be met, in the each state needs to have a condtion be met before it can move on to the next state where as in non deterministic finite state machines a state can progress to another state with no input entered by the user. Also in the non deterministic finite state automata there can be multiple arcs and the transition from one state to state is not determined this means that the route from the start state to the end state is not predefined. Non deterministic finite state automata is the type of automata that was used to design this morphological analyzer. Since the route from the start state to the end state is not predefined, it is possible to model words of a similar nature together. For example amanga, ilanga, uyapanga, all these words although have the same condition/input for the final state the start state however is different for all these words that are specified. This makes the NFA approach the most favorable for modelling languages as it can be used to model singular words, plural words for example and many more rules of linguistics can be modelled by this approach.
We use the software tool Jflap to create the non deterministic finite state automata and this allowed us to experiment with the language. From the use of this tool, we are also able to gain a full view and interpretation of the language. We are then able to generate regular expressions for the finite state automata. The regular expressions will then provide computational rules that can be used to model a morphological analyzer.

Evaluation

This spell checker was tested against words from the Ukwabelana corpus and the 2ml token corpus, there was also a mini word list that tested the extreme cases of the spell checker. These are strings that include numbers and special characters within them. From this, we tested to see if the spell checker. The method was used to test the systems accuracy was the confusion matrix. in this method, there are four key features to be tested for. The first is the true positives (TN), identifying words that are correctly spelled. True negatives (TN), identifying words that are incorrectly spelled. False positives (FP), identifying words that are incorrectly spelled as correctly spelled. False negatives (FN), identifying words that are valid as incorrectly spelled. We took 50 words from the Ukwabelana corpus, 50 words from the 2ml token corpus and 12 words from the mini word list.

Results

The accuracy of the spell checker using the Ukwabelana corpus registered at 86 percent. the morphological analyzer that has been created by using a step by step procedure allows us to get a more in depth look at the language in order to provide a spell checker that gives a more accurate representation of the isiZulu language. The accuracy of the spell checker when tested using the 2ml corpus, the accuracy rate of this is at 92 percent. This is very surprising because this corpus has strings of very old isiZulu language and the dialect is a little off and is very old.