Available NLP Datasets#
In this section, we provide documentation about speech, machine translation and other NLP datasets which can be used for advancing NLP capabilities for this language. The datasets are presented based on major NLP tasks of Automated Speech Recognition (ASR) as speech datasets, machine translation and other NLP tasks which can be language modelling, text summarization and more.
Speech#
speech dataset for Chichewa . This dataset was collected as part of the Zindi-Google NLP competition which I participated in. The data consists of varied length audio clips in Chichewa and accompanying transcriptions. See google-asr-hack-series-africa-asr-data-challenge for details of the competition.
Machine translation#
I havent seen any corpus of machine translation dataset yet. However, I’m building the following datasets which are not ready for public release yet but I used them in the benchmrking of Google translation.