Background Information

Morphological Segmentation is a Linguistic Operation wherein words are separated into their composite morphemes. Morphemes are the smallest possible building blocks of language that also have meaning when alone. This is a useful operation because it facilitates the study of words at a granular level. We aim to perform this operation through the use of Machine Learning, both supervised and unsupervised. We implemented three machine learning models, namely Conditional Random Fields, Sequence to Sequence and Morfessor Based Models to perform the segmentation. Low resource languages are languages in which there is a lack of text data, and Natural Language Processing applications. Where these applications do exist they tend to be low accuracy, and it is our hope that this project and the models that come from it will help in the creation of higher accuracy applications. The languages we focused on were languages in the Nguni family of languages, namely, isiNdebele, isiXhosa, isiZula and siSwati. These languages are all agglutinative, meaning words in the language are solely composed of concatenating morphemes.

Morphological Segmentation

There are two ways in which words can be segmented, Surface Segmentation and Canonical Segmentation. Using surface segmentation a given word will be segmented into a sequence of sub-strings, which when concatenated will form the original word. Using canonical segmentation, the word will be analyzed and segmented into a sequence of canonical morphemes, where each canonical morpheme corresponds to a surface morpheme as its orthographic representation. See blow for an example.
Word: attainability
Surface Segmentation: attain-abil-ity
Canonical Segmentation: attain-able-ity
After the segments have been predicted, the output can be further divided by whether or not the segments are labelled. If they are, a label will be predicted for each segment, and these labels will correspond to the function of the morpheme to the word as a whole.

Project Goal

The goal of this project is to determine whether or not the models mentioned above can be successfully applied to the task of morphological segmentation. We will determine their success in the tasks thorugh the use of three metrics, precision, recall and F1 score. Additionally, we aim to do a high level performance of the models and their implementations to determine which is best suited to the task.

Data Used

The data that we made use of in the training and testing of our models came from the National Centre for Human Language Technology (NCHLT). They comprised of several text corpora upon which we needed to perform some data cleaning operations before they were ready for use. The data we received came in the following form:
ngezinkonzo khonzo P (RelConc)-nga[NPre]-i(NPrePre)-zin[BPre]-konzo[NStem]
Where the first element is the word itself, the second is the root morpheme, the third is the part of speech of the word and the fourth is the labelled canonical segmented form of the word.
After multiple cleaning operations we were able to convert the data to the following form:
ngezinkonzo nge-zin-konzo nge[NPre]zin[BPre]konzo[NStem] nga[NPre]i[NPrePre]zin[BPre]konzo[NStem]
Where the first element is the word, the second is the surface segmented form of the word, the third is the labelled surface form of the word and the fourth is the cleaned labelled canonical segmented form of the word.

Results & Discussion

For the sake of brevity, we will only be showing the best results of our respective models, however for those who wish to see more they can be found in our deliverables

Traditional CRF

Of the CRFs implemented the traditional CRF gave the most favourable results. As can be seen the surface segmentation model gives state of the art performance, for the most part it segments words correctly. This performance is more than good enough to be used in the development of Natural Language Processing applications, furthermore, because of the generalizable nature of the implementation, this same model should be usable for other agglutinative languages given the necessary data in the correct format.
The model for the labelling of correct segments performed fairly well, with all of them being over 70%, however it is nowhere near as good as the surface segmentation model. Similarly this could also be used on other agglutinative languages, however it might be beneficial to further tune some features before it is ready for use in Natural Language Processing applications.

Language	Precision (%)	Recall (%)	F1 Score (%)
isiNdebele	97.94	96.62	97.27
isiXhosa	97.16	97.13	97.14
isiZulu	97.88	96.82	97.35
siSwati	97.17	96.40	96.78

Average Surface Segmentation Results

Language	Precision (%)	Recall (%)	F1 Score (%)
isiNdebele	77.07	78.24	77.65
isiXhosa	71.16	71.50	71.33
isiZulu	71.82	72.11	71.69
siSwati	84.69	90.05	87.29

Average Labelling of Correct Segments Results

Morfessor-Baseline & Entropy-based model

In the initial evaluation of the models, Morfessor-Baseline performed marginally better than the best performing entropy-based model. The best performing entropy-based model used an objective function that produces a morpheme boundary if the addition of the left and right entropy at a position in a word exceeded that of an experimentally determined constant. After a few adjustments, this entropy-based model outperformed Morfessor-Baseline. The entropy-based model resulted in an average F1 Score of 31.51% while Morfessor attained an average F1 Score of 27.69%. The following tables provide more detail on Morfessor-Baseline's and the entropy-based model's performance, respectively.

Language	Precision (%)	Recall (%)	F1 Score (%)
isiNdebele	20.60	21.36	20.97
isiXhosa	27.21	29.04	28.10
isiZulu	20.37	23.19	21.69
siSwati	44.05	36.67	40.02

Results for Morfessor-Baseline on all four languages

Language	Precision (%)	Recall (%)	F1 Score (%)
isiNdebele	19.39	56.44	28.87
isiXhosa	18.18	43.08	25.56
isiZulu	18.75	46.55	26.74
siSwati	58.85	36.22	44.85

Results for Entropy-based model with constant objective function on all four languages

Bi-LSTM+Attention and Transformer Model

Taking a look at both implemented Seq2Seq models, we can see the Transformer model being the most promising with the Bi-LSTM+Attention following it. The Transformer model achieved an average F1 score of 72.54%, which is an 11.95% over the standard LSTM baseline. The Bi-LSTM+Attention achieved an average F1 score of 65.41, which is a 4.82% improvement over the baseline. The Transformer shows the best results across all languages and gathered a satisfactory accuracy score at segmenting words. The F1 scores of the Transformer showed the lowest standard deviation between the Nguni languages which shows its ability to generalise across multiple languages. The key take aways are that the recurrent nature of the Bi-LSTM is not necessary when trying to identify morpheme boundaries ultimately meaning that the Transformer model has a promising future in morphological segmentation.

Language	Precision (%)	Recall (%)	F1 Score (%)
isiNdebele	64.90	59.67	62.18
isiXhosa	70.06	62.86	66.26
isiZulu	68.58	62.45	65.37
siSwati	69.45	66.25	67.82

Results for Bi-LSTM + Attention model across all four languages

Language	Precision (%)	Recall (%)	F1 Score (%)
isiNdebele	73.14	66.67	69.76
isiXhosa	75.76	68.36	71.87
isiZulu	77.34	71.04	74.06
siSwati	76.07	72.96	74.48

Results for Transformer model across all four languages

Discussion

As can be seen, of all the models implemented the CRF delivered the best performance in the task of Surface Segmentation and surfgace labelling. In the task of canonical segmentation, the sequence to sequence model delivered the best performancre. Unfortunately the Entropy Based models did not score as well as the others in terms of the metrics provided, scoring around 40% less than the other models, which could be due to not enough data in training due to the nature of unsupervised models.
From this we can determine that of the models implemented the CRF would be the best to use in the task of surface segmentation and the sequence to sequence model would be best to use in the task of canonical segmentation

MORPH-SEGMENT

Morphological Segmentation of Low Resource Languages Using a Variety of Machine Learning Models

Background Information

Morphological Segmentation

Project Goal

Data Used

Conditional Random Fields

A summary

Conditional Random Fields

Conditional Random Fields

Traditional CRF

Conditional Random Fields

Bi-LSTM-CRF

Entropy-based model

A summary

Long-Short Term Memory

LSTM

Bi-entropy and objective functions

Sequence to Sequence

A summary

Encoder-Decoder

LSTM

Attention Mechanism

Transformer

Results & Discussion

Traditional CRF

Morfessor-Baseline & Entropy-based model

Bi-LSTM+Attention and Transformer Model

Discussion

Deliverables

Literature Reviews

Final Report

Project Proposal

Project Poster