An isiZulu spellchecker

Using data driven statistical language model

Introduction

This spellchecker was developed using a data driven statistical language model. It comprises of the pre-processing, error detection and post-processing stage. The system is used to predict whether an IsiZulu word is wrongly spelled or correctly spelled. It allows a user to test a single word or a sentence for spelling. It uses a data driven statistical language model. The only way to run the system is to associate it with some data. The decision taken on a word, whether it is wrongly or correctly spelled, is not based on personal intuition but on data. The lexical content of a dictionary or corpus drives the spellchecking process. The system will have a corpus of words. Statistics will be computed from the corpus that will guide the decision of whether a word is correctly spelled or wrong spelled. The system was evaluated on speed and accuracy.

Problem statement

Decisions taken by this spellchecker are based on the corpus. A corpus is a large collection of texts. A spellchecker is as good as its corpus. Therefore, we need to be careful in selecting and using corpora. Availability of such corpora is of the essence as well. We used three corpora to the test the System. Language evolves, so should the corpora to yield good results.

Algorithm

The error detector of this spellchecker uses n-gram analysis. An n-gram is an n letter subsequence of a string, where n usually is 1, 2, or 3. The system employed character-based N-gram language models. An IsiZulu word "isenzo" trigrams and four-grams are "ise","sen","enz","nzo" and "isen","senz","enzo" respectively.The n-gram analysis technique used check each n-gram in an input text and compare it against an existing table of n-gram statistics. N-gram statistics is the frequency counts or probability of occurrence of N-grams. N-gram statistics is compared with the predetermined threhold. If the frequency of the n-gram is below the threshold the word is flagged as wrongly spelled. This means n-grams that do not occur or with infrequent occurrences are considered to be misspellings. For the detailed characteristics of the corpora, check the report on the project archives.

The system was developed using Java and Mysql for storage.

Evaluation

We used 10-fold cross validation to partition a corpus into the training data set and the testing dataset. This is done by randomly breaking the corpus into ten sets of equal size. We carried out 10 experiments, and used 9 folds for training and unique words from the remaining one for testing. Each data split is used for testing once. Assuming that the corpus contains zero incorrect words, we intentionally introduced 46 known wrongly spelled words to each testing dataset on each experiment. The spell checker was fed with the same wrongly spelled words as testing rotates across 10 folds. The performance of the system on each test was evaluated using the confusion matrix. The average was computed from the measures of all 10 tests to find accuracy rate. The experiment was carried out on all the corpora. The second phase of our experiment was to test with the entire unique word set for each of the other 2 corpora and state the results for each corpus separately. The spell checker was trained with each corpus and tested with the other 2 corpora (unique words). The performance of the system on each test was also evaluated using a confusion matrix.

The system was evaluated using the confusion matrix. The confusion matrix classifies the results of the test into true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). The System was evaluated on 0.003 and 0.004 thresholds.For more details check the report on the project archives.

Results and Conclusions

The spellchecker is accurate in detecting words that do not occur in the training corpus. All three corpora used in n-fold cross validation performed well tested with their fragments. For trigrams, Ukwebalana corpus gave accuracy rate of 85% at the threshold of 0.003, Prof. Langa corpus gave accuracy rate of 67% at the threshold of 0.003 and news items corpus gave accuracy rate of 76% at the threshold of 0.003. They all had accuracy rates above 50%. Ukwebalana corpus had the best results. Testing them with each other reflected something else. It is imperative to update corpora for spellcheckers. An outdated corpus can lead to poor performance. Testing both corpora with Ukwebalana gave accuracy rate below 50%. Ukwebalana corpus gave 53% accuracy rate when tested with Prof. Lang corpus and gave 70% when tested with news items corpus; News items corpus gave 89% accuracy rate when tested with Prof. Langa corpus and 27% tested with Ukwebalana corpus; and lastly, Prof. Langa corpus gave 89% accuracy rate when tested with news items corpus and 41% accuracy rate when tested with Ukwebalana corpus. It also showed that a spellchecker is as good as its corpus. The newly composed corpora performed really badly when tested with old words from the Ukwebalana corpus. We also discovered that because Ukwebalana contained highly frequent words, it managed to still perform well when tested with other corpora. This means most of the trigrams in the testing corpora were present in Ukwebalana corpus. We then deduced that the larger the corpus the better the performance. It will show highly frequent words and the spellchecker will target those. The most updated corpora are preferable and the 0.003 threshold. For four-grams, Ukwebalana corpus gave accuracy rate of 80% at the threshold of 0.003, Prof. Langa corpus gave accuracy rate of 63% at the threshold of 0.003 and news items corpus gave accuracy rate of 79% at the threshold of 0.003. Testing both corpora with Ukwebalana gave accuracy rate below 50%. Ukwebalana corpus gave 50% accuracy rate when tested with Prof. Lang corpus and gave 69% when tested with news items corpus; News items corpus gave 88% accuracy rate when tested with Prof. Langa corpus and 27% tested with Ukwebalana corpus; and lastly, Prof. Langa corpus gave 86% accuracy rate when tested with news items corpus and 41% accuracy rate when tested with Ukwebalana corpus. The spellchecker performed slightly better with trigrams than with four-grams. The probability of finding four-grams of a word was lower than the probability of trigrams. This increased the number of false negatives for each test. For detailed results check project archives

800*800

Sample performance of n-fold cross validation for Ukwebalana corpus, Prof. Langa corpus and news items corpus. For all results check project archieves