Description
I have out-of-tree normalization benchmarks that use Wikipedia content and that take a rather long time to run.
So far, experience suggests that English and Greek normalization performance are particularly sensitive to compiler optimizations. Chinese makes sense to benchmark, because it represents the case where just about every character is normalization-invariant and gets the default trie value upon trie lookup. Korean normalization is differs from everything else due to the algorithmic nature of Hangul composition and decomposition. Kannada might make sense to benchmark, since it has backward-combining starters. It might also make sense to have a Vietnamese test, since Vietnamese has frequent double-diacritics in Latin text.
To catch regressions, we should have a CI-run normalization benchmark with at least these cases:
- UTF-16 English NFC to NFC with input being at least 4 memory pages long.
- UTF-16 English NFD to NFD with input being at least 4 memory pages long.
&str
English NFC to NFC&str
English NFD to NFD&str
Greek NFC to NFC.&[u8]
Greek NFC to NFC.- UTF-16 Chinese NFC to NFC.
- Korean NFC to NFC.
- Korean NFD to NFC.
- Korean NFC to NFD.
- Kannada NFC to NFC (not sure about this one)
- Vietnamese orthographic (the form produced by the standard non-IME keyboard layout) to NFC
Not sure what UTF makes the most sense to bench in CI for the last 5.