MORPH-SEGMENT

Morphological Segmentation of Low Resource Languages Using a Variety of Machine Learning Models

Background Information

Morphological Segmentation is a Linguistic Operation wherein words are separated into their composite morphemes. Morphemes are the smallest possible building blocks of language that also have meaning when alone. This is a useful operation because it facilitates the study of words at a granular level. We aim to perform this operation through the use of Machine Learning, both supervised and unsupervised. We implemented three machine learning models, namely Conditional Random Fields, Sequence to Sequence and Morfessor Based Models to perform the segmentation. Low resource languages are languages in which there is a lack of text data, and Natural Language Processing applications. Where these applications do exist they tend to be low accuracy, and it is our hope that this project and the models that come from it will help in the creation of higher accuracy applications. The languages we focused on were languages in the Nguni family of languages, namely, isiNdebele, isiXhosa, isiZula and siSwati. These languages are all agglutinative, meaning words in the language are solely composed of concatenating morphemes.

Morphological Segmentation

There are two ways in which words can be segmented, Surface Segmentation and Canonical Segmentation. Using surface segmentation a given word will be segmented into a sequence of sub-strings, which when concatenated will form the original word. Using canonical segmentation, the word will be analyzed and segmented into a sequence of canonical morphemes, where each canonical morpheme corresponds to a surface morpheme as its orthographic representation. See blow for an example.
Word: attainability
Surface Segmentation: attain-abil-ity
Canonical Segmentation: attain-able-ity
After the segments have been predicted, the output can be further divided by whether or not the segments are labelled. If they are, a label will be predicted for each segment, and these labels will correspond to the function of the morpheme to the word as a whole.

Project Goal

The goal of this project is to determine whether or not the models mentioned above can be successfully applied to the task of morphological segmentation. We will determine their success in the tasks thorugh the use of three metrics, precision, recall and F1 score. Additionally, we aim to do a high level performance of the models and their implementations to determine which is best suited to the task.

Data Used

The data that we made use of in the training and testing of our models came from the National Centre for Human Language Technology (NCHLT). They comprised of several text corpora upon which we needed to perform some data cleaning operations before they were ready for use. The data we received came in the following form:
ngezinkonzo   khonzo   P   (RelConc)-nga[NPre]-i(NPrePre)-zin[BPre]-konzo[NStem]
Where the first element is the word itself, the second is the root morpheme, the third is the part of speech of the word and the fourth is the labelled canonical segmented form of the word.
After multiple cleaning operations we were able to convert the data to the following form:
ngezinkonzo   nge-zin-konzo   nge[NPre]zin[BPre]konzo[NStem]   nga[NPre]i[NPrePre]zin[BPre]konzo[NStem]
Where the first element is the word, the second is the surface segmented form of the word, the third is the labelled surface form of the word and the fourth is the cleaned labelled canonical segmented form of the word.

Results & Discussion

For the sake of brevity, we will only be showing the best results of our respective models, however for those who wish to see more they can be found in our deliverables

Traditional CRF

Of the CRFs implemented the traditional CRF gave the most favourable results. As can be seen the surface segmentation model gives state of the art performance, for the most part it segments words correctly. This performance is more than good enough to be used in the development of Natural Language Processing applications, furthermore, because of the generalizable nature of the implementation, this same model should be usable for other agglutinative languages given the necessary data in the correct format. 
 The model for the labelling of correct segments performed fairly well, with all of them being over 70%, however it is nowhere near as good as the surface segmentation model. Similarly this could also be used on other agglutinative languages, however it might be beneficial to further tune some features before it is ready for use in Natural Language Processing applications.

Language Precision (%) Recall (%) F1 Score (%)
isiNdebele 97.94 96.62 97.27
isiXhosa 97.16 97.13 97.14
isiZulu 97.88 96.82 97.35
siSwati 97.17 96.40 96.78

Average Surface Segmentation Results

Language Precision (%) Recall (%) F1 Score (%)
isiNdebele 77.07 78.24 77.65
isiXhosa 71.16 71.50 71.33
isiZulu 71.82 72.11 71.69
siSwati 84.69 90.05 87.29

Average Labelling of Correct Segments Results

Morfessor-Baseline & Entropy-based model

In the initial evaluation of the models, Morfessor-Baseline performed marginally better than the best performing entropy-based model. The best performing entropy-based model used an objective function that produces a morpheme boundary if the addition of the left and right entropy at a position in a word exceeded that of an experimentally determined constant. After a few adjustments, this entropy-based model outperformed Morfessor-Baseline. The entropy-based model resulted in an average F1 Score of 31.51% while Morfessor attained an average F1 Score of 27.69%. The following tables provide more detail on Morfessor-Baseline's and the entropy-based model's performance, respectively.

Language Precision (%) Recall (%) F1 Score (%)
isiNdebele 20.60 21.36 20.97
isiXhosa 27.21 29.04 28.10
isiZulu 20.37 23.19 21.69
siSwati 44.05 36.67 40.02

Results for Morfessor-Baseline on all four languages

Language Precision (%) Recall (%) F1 Score (%)
isiNdebele 19.39 56.44 28.87
isiXhosa 18.18 43.08 25.56
isiZulu 18.75 46.55 26.74
siSwati 58.85 36.22 44.85

Results for Entropy-based model with constant objective function on all four languages

Discussion

As can be seen, of all the models implemented the CRF delivered the best performance in the task of Surface Segmentation and surfgace labelling. In the task of canonical segmentation, the sequence to sequence model delivered the best performancre. Unfortunately the Entropy Based models did not score as well as the others in terms of the metrics provided, scoring around 40% less than the other models, which could be due to not enough data in training due to the nature of unsupervised models.
From this we can determine that of the models implemented the CRF would be the best to use in the task of surface segmentation and the sequence to sequence model would be best to use in the task of canonical segmentation

Deliverables

Literature Reviews

Final Report

Project Proposal

Project Poster