-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Labels
enhancementNew feature or requestNew feature or request
Description
To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.
The dataset can be used in the following ways:
- Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
- Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]
Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request