Skip to content

Add in-tree normalization benchmarks #2444

Open
@hsivonen

Description

@hsivonen

I have out-of-tree normalization benchmarks that use Wikipedia content and that take a rather long time to run.

So far, experience suggests that English and Greek normalization performance are particularly sensitive to compiler optimizations. Chinese makes sense to benchmark, because it represents the case where just about every character is normalization-invariant and gets the default trie value upon trie lookup. Korean normalization is differs from everything else due to the algorithmic nature of Hangul composition and decomposition. Kannada might make sense to benchmark, since it has backward-combining starters. It might also make sense to have a Vietnamese test, since Vietnamese has frequent double-diacritics in Latin text.

To catch regressions, we should have a CI-run normalization benchmark with at least these cases:

  • UTF-16 English NFC to NFC with input being at least 4 memory pages long.
  • UTF-16 English NFD to NFD with input being at least 4 memory pages long.
  • &str English NFC to NFC
  • &str English NFD to NFD
  • &str Greek NFC to NFC.
  • &[u8] Greek NFC to NFC.
  • UTF-16 Chinese NFC to NFC.
  • Korean NFC to NFC.
  • Korean NFD to NFC.
  • Korean NFC to NFD.
  • Kannada NFC to NFC (not sure about this one)
  • Vietnamese orthographic (the form produced by the standard non-IME keyboard layout) to NFC

Not sure what UTF makes the most sense to bench in CI for the last 5.

Metadata

Metadata

Assignees

Labels

C-collatorComponent: Collation, normalizationC-test-infraComponent: Integration test infrastructure

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions