SABC2TXT

Background

In South Africa, several African languages - nine out of eleven official languages - lack electronic linguistic resources such as documents or books. This lack of resources negatively impacts computational and statistical systems, like language models, that rely on these articles. High-quality text corpora could facilitate text-based research experiments, and one way to produce high-quality text documentation in these languages is to transcribe already available audio in these languages. This transcription could be done by speech recognition tools, such as Automatic Speech Recognition (ASR), which is a mechanical method of decoding oral speech through a microphone, analysing the data with a pattern, model, or algorithm, and generating an output, often text.

The project aims to evaluate if it is possible to automatically transcribe structured and unstructured audio to create a high-quality textual corpus using standard speech tools and models. The CMUSphinx (PocketSphinx and Sphinx4) speech recognition toolkit which has been shown to provide the best results for the transcription of isiXhosa was used for the audio transcriptions, and the data used to train the acoustic and language models used in the toolkit was obtained from the publicly available South African Centre for Digital Language Resources (SADiLaR) website.

The project is split into three sections. The first section involves transcribing structured audio, such as SABC broadcast news, publicly available on YouTube. These recordings are seen as a structured environment, as the speech styles present will be formal and read from a script. This audio will be used to evaluate a transcription system that will, if successful, transcribe broadcast news recordings. These transcriptions can be used for further text-based research and to increase the number of electronic documents available in low-resource languages.

The second section involves the evaluation of the quality of using mobile devices for the transcription of unstructured audio, such as a casual conversation between two people in a noisy environment. Mobile device transcription is important since it can aid in the creation of electronic resources for low-resource South African languages. The percentage of people in sub-Saharan Africa who use mobile phones has increased significantly over the past ten years, reaching 60% of the total population. This means that even if they do not have access to more expensive equipment, speakers of low-resource languages can capture and transcribe audio using their mobile devices.

Lastly, a gold standard corpus will be created for the evaluation of the accuracy of the transcribed audio. Gold standard corpora are manually annotated collections of text used as dependable sources of information regarding languages. They are necessary for the training and meaningful evaluation of algorithms. The success of the project would mean that text corpora can be accurately developed using standard speech tools and models, which is essential for natural language processing as it is a primary source of data.

Research Questions

More details on the individual sections

Gold Standard ASR Corpus

Can crowdsourcing via a web application effectively produce a high-quality gold standard Automatic Speech Recognition corpus for isiXhosa from audio data collected from casual conversations and broadcast news?