Add in-tree normalization benchmarks

I have [out-of-tree normalization benchmarks](https://github.com/hsivonen/norm_bench) that use Wikipedia content and that take a rather long time to run.

So far, experience suggests that English and Greek normalization performance are particularly sensitive to compiler optimizations. Chinese makes sense to benchmark, because it represents the case where just about every character is normalization-invariant and gets the default trie value upon trie lookup. Korean normalization is differs from everything else due to the algorithmic nature of Hangul composition and decomposition. Kannada might make sense to benchmark, since it has backward-combining starters. It might also make sense to have a Vietnamese test, since Vietnamese has frequent double-diacritics in Latin text.

To catch regressions, we should have a CI-run normalization benchmark with at least these cases:

* UTF-16 English NFC to NFC with input being at least 4 memory pages long.
* UTF-16 English NFD to NFD with input being at least 4 memory pages long.
* `&str` English NFC to NFC
* `&str` English NFD to NFD
* `&str` Greek NFC to NFC.
* `&[u8]` Greek NFC to NFC.
* UTF-16 Chinese NFC to NFC.
* Korean NFC to NFC.
* Korean NFD to NFC.
* Korean NFC to NFD.
* Kannada NFC to NFC (not sure about this one)
* Vietnamese [orthographic](https://crates.io/crates/detone) (the form produced by the standard non-IME keyboard layout) to NFC

Not sure what UTF makes the most sense to bench in CI for the last 5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add in-tree normalization benchmarks #2444

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add in-tree normalization benchmarks #2444

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions