This repo aggregates audio/speech corpora for Yorùbá tasks, similarly to the yoruba-text for text datasets. The corpora may contain aligned text or be purely unlabeled.
The objective is to have a bird's eye view of available Yorùbá audio, and it's metadata and entropy, to inform additional data collection tasks & modeling. For example, if we see a large Broadcast news corpus, we might be interested to train a self-supervised model on a pretext task to generate speech embeddings for use in ASR/TTS work.
| Name | Size in HH:MM:SS | Transcribed | Segmented in utterances | Aligned | Source |
|---|---|---|---|---|---|
| Lagos-NWU | 02:45:17 | ✔️ | ✔️ | ✔️ | North-West University |
| OpenSLR86 | 04:1:31 | ✔️ | ✔️ | ✔️ | OpenSLR, Google |
| Bíbélì Mímọ́ (NIV) | 93:38:15 | ✔️ | Biblica Open Bible | ||
| Bíbélì Mímọ́ (KJV) | ✔️ | Bible.is | |||
| Colloquial Yorùbá | 02:32:29 | ✔️ | Audio files, Textbook | ||
| OrisunTV Broadcast News | 81:49:29 | Youtube | |||
| VoxLingua107 | 94:2:45 | ✔️ | post-filtered from Youtube |