Morphological Segmentation is a Linguistic Operation wherein words are separated into their composite morphemes. Morphemes are the smallest possible building blocks of language that also have meaning when alone. This is a useful operation because it facilitates the study of words at a granular level. We aim to perform this operation through the use of Machine Learning, both supervised and unsupervised. We implemented three machine learning models, namely Conditional Random Fields, Sequence to Sequence and Morfessor Based Models to perform the segmentation. Low resource languages are languages in which there is a lack of text data, and Natural Language Processing applications. Where these applications do exist they tend to be low accuracy, and it is our hope that this project and the models that come from it will help in the creation of higher accuracy applications. The languages we focused on were languages in the Nguni family of languages, namely, isiNdebele, isiXhosa, isiZula and siSwati. These languages are all agglutinative, meaning words in the language are solely composed of concatenating morphemes.
There are two ways in which words can be segmented, Surface Segmentation and Canonical Segmentation. Using surface segmentation a given word will be segmented into a sequence of sub-strings, which when concatenated will form the original word. Using canonical segmentation, the word will be analyzed and segmented into a sequence of canonical morphemes, where each canonical morpheme corresponds to a surface morpheme as its orthographic representation. See blow for an example.
Word: attainability
Surface Segmentation: attain-abil-ity
Canonical Segmentation: attain-able-ity
After the segments have been predicted, the output can be further divided by whether or not the segments are labelled. If they are, a label will be predicted for each segment, and these labels will correspond to the function of the morpheme to the word as a whole.
The goal of this project is to determine whether or not the models mentioned above can be successfully applied to the task of morphological segmentation. We will determine their success in the tasks thorugh the use of three metrics, precision, recall and F1 score. Additionally, we aim to do a high level performance of the models and their implementations to determine which is best suited to the task.
The data that we made use of in the training and testing of our models came from the National Centre for Human Language Technology (NCHLT). They comprised of several text corpora upon which we needed to perform some data cleaning operations before they were ready for use. The data we received came in the following form:
ngezinkonzo khonzo P (RelConc)-nga[NPre]-i(NPrePre)-zin[BPre]-konzo[NStem]
Where the first element is the word itself, the second is the root morpheme, the third is the part of speech of the word and the fourth is the labelled canonical segmented form of the word.
After multiple cleaning operations we were able to convert the data to the following form:
ngezinkonzo nge-zin-konzo nge[NPre]zin[BPre]konzo[NStem] nga[NPre]i[NPrePre]zin[BPre]konzo[NStem]
Where the first element is the word, the second is the surface segmented form of the word, the third is the labelled surface form of the word and the fourth is the cleaned labelled canonical segmented form of the word.
For the sake of brevity, we will only be showing the best results of our respective models, however for those who wish to see more they can be found in our deliverables
Language | Precision (%) | Recall (%) | F1 Score (%) |
---|---|---|---|
isiNdebele | 97.94 | 96.62 | 97.27 |
isiXhosa | 97.16 | 97.13 | 97.14 |
isiZulu | 97.88 | 96.82 | 97.35 |
siSwati | 97.17 | 96.40 | 96.78 |
Average Surface Segmentation Results
Average Labelling of Correct Segments Results
In the initial evaluation of the models, Morfessor-Baseline performed marginally better than the best performing entropy-based model. The best performing entropy-based model used an objective function that produces a morpheme boundary if the addition of the left and right entropy at a position in a word exceeded that of an experimentally determined constant. After a few adjustments, this entropy-based model outperformed Morfessor-Baseline. The entropy-based model resulted in an average F1 Score of 31.51% while Morfessor attained an average F1 Score of 27.69%. The following tables provide more detail on Morfessor-Baseline's and the entropy-based model's performance, respectively.
Results for Morfessor-Baseline on all four languages
Results for Entropy-based model with constant objective function on all four languages
Taking a look at both implemented Seq2Seq models, we can see the Transformer model being the most promising with the Bi-LSTM+Attention following it. The Transformer model achieved an average F1 score of 72.54%, which is an 11.95% over the standard LSTM baseline. The Bi-LSTM+Attention achieved an average F1 score of 65.41, which is a 4.82% improvement over the baseline. The Transformer shows the best results across all languages and gathered a satisfactory accuracy score at segmenting words. The F1 scores of the Transformer showed the lowest standard deviation between the Nguni languages which shows its ability to generalise across multiple languages. The key take aways are that the recurrent nature of the Bi-LSTM is not necessary when trying to identify morpheme boundaries ultimately meaning that the Transformer model has a promising future in morphological segmentation.
Results for Bi-LSTM + Attention model across all four languages
Results for Transformer model across all four languages
As can be seen, of all the models implemented the CRF delivered the best performance in the task of Surface Segmentation and surfgace labelling. In the task of canonical segmentation, the sequence to sequence model delivered the best performancre. Unfortunately the Entropy Based models did not score as well as the others in terms of the metrics provided, scoring around 40% less than the other models, which could be due to not enough data in training due to the nature of unsupervised models.
From this we can determine that of the models implemented the CRF would be the best to use in the task of surface segmentation and the sequence to sequence model would be best to use in the task of canonical segmentation