Prepare partially diacritized input dataset 

To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels. 

The dataset can be used in the following ways:
1) Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider 
2) Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]

**Motivation**:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare partially diacritized input dataset #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Prepare partially diacritized input dataset #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions