Skip to content

Prepare partially diacritized input dataset  #1

@ruohoruotsi

Description

@ruohoruotsi

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.

The dataset can be used in the following ways:

  1. Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
  2. Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions