Consistency in orthography listing

As we prepare implementing shaperglot for testing African language support, I am noticing that there is variation in the way language orthographies are incorporated in gflang. Here are few examples:

bas_Latn
```
exemplar_chars {
  base: "a á à â ǎ ā {a᷆}{a᷇} b ɓ c d e é è ê ě ē {e᷆}{e᷇} ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆}{ɛ᷇} f g h i í ì î ǐ ī {i᷆}{i᷇} j k l m n ń ǹ ŋ o ó ò ô ǒ ō {o᷆}{o᷇} ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆}{ɔ᷇} p r s t u ú ù û ǔ ū {u᷆}{u᷇} v w y z {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇} {a᷆} {a᷇} {e᷆} {e᷇} {ɛ́} {ɛ̀} {ɛ̂} {ɛ̌} {ɛ̄} {ɛ᷆} {ɛ᷇} {i᷆} {i᷇} {o᷆} {o᷇} {ɔ́} {ɔ̀} {ɔ̂} {ɔ̌} {ɔ̄} {ɔ᷆} {ɔ᷇} {u᷆} {u᷇}"
  auxiliary: "q x"
  marks: "◌̀ ◌́ ◌̂ ◌̄ ◌̌ ◌᷆ ◌᷇"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  index: "A B Ɓ C D E Ɛ F G H I J K L M N Ŋ O Ɔ P R S T U V W Y Z"
}
```

bin_Latn
```
exemplar_chars {
  base: "A B D E F G H I K L M N O P R S T U V W Y Z Á É È Ẹ Í Ó Ò Ọ Ú a b d e f g h i k l m n o p r s t u v w y z á é è ẹ í ó ò ọ ú \'"
  marks: "◌̀ ◌́ ◌̣"
}
```

af_Latn
```
exemplar_chars {
  base: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z"
  auxiliary: "à å ä ã æ ç í ì ó ò ú ù ü ý"
  marks: "◌̀ ◌̂ ◌̈"
  numerals: "  - ‑ , % ‰ + 0 1 2 3 4 5 6 7 8 9"
  punctuation: "- ‐ ‑ – — , ; : ! ? . … \' ‘ ’ \" “ ” ( ) [ ] § @ * / & # † ‡ ′ ″"
  index: "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z"
}
```

The first inconsistency is that not all language profiles contain auxiliary bases when they should. When auxiliary bases include a mark the mark list doesn't always include those accents. 

The second big inconsistency is the inclusion of non-precomposed base/mark pairs in the base list. Sometimes these pairs are in base list and sometimes they are not.

In order for shaperglot to properly parse gflang to run its orthography tests we need some consistency in how the exemplar character lists are constructed.  For the purposes of shaperglot, it is good to have gflang contain all necessary base/mark pairs regardless if they can be precomposed or not. It appears like the variation is caused by the incoming source data. (The bas_Latn entry reflects the data in CLDR, including the lack of spaces between certain bases.) Should we have a guideline specifically spells out what needs to be included in bases, auxiliary, and marks?

Perhaps something like: 
-bases: all primary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-auxiliary: all secondary characters of a language including precomposed base/mark pairs and non-composed base/mark pairs, when a precomposed character is not encoded
-marks: all standalone marks whether they are primary or auxiliary 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consistency in orthography listing #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consistency in orthography listing #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions