Description
I've been thinking how we could compute the information necessary for identical prefix skipping in the collator using the data that we already have. Here's a sketch:
For the following to make sense, #2386 should be fixed first.
On a per-code-unit basis, find the first mismatch. Then move to operating on scalar values where the first mismatching scalars are c1 and c2.
In all cases with trie lookups in the identical prefix skipping precess, remember the NFD trie value if not for a self-decomposing stater, or, if it is for a self-decomposing starter, the CE32 and whether the CE32 came via fallback. Once we start moving forward, the buffer of these remembered values should be consumed first before starting to pull characters from the underlying iterator. This way, the trie reads here are no worse than what trie reads would be performed in the absence of this optimization. The remembered values here would go into a SmallVec
in place of the Option<CharacterAndTrieValue>
contemplated in #2386. (This does introduce per-character branches to decide of there are remembered values to use of if new trie lookups need to be made.)
Allocate a new "no trie" flag for contraction CE32s to indicate that there's no Chars16Trie
, just the default value.
Assertion: Characters that map to to a prefix CE32 are always self-decomposing starters. (Check if actually true!)
In the loop, the variables not only bind to a scalar value but also its position.
Loop:
Look up c1 from the NFD trie. If the result of the lookup is 1 or 0 (starter that decomposes to self), look up the CE32 for c1. If the CE32 is a prefix CE32, let c1 and c2 be the scalar previous to c1 and c2 (the previous scalar always matches between the inputs) and continue the loop.
Otherwise if c1 is not equal to c2 (only happens on the first iteration; in implementation hoist this case out of the loop somehow), look up c2 from the the NFD trie. If the result of the lookup is 1 or 0, look up the CE32 for c2. If the CE32 is a prefix CE32, let c1 and c2 be the scalar previous to c1 and c2 and continue the loop.
Let prev be the scalar previous to c1 and c2. (The scalar previous to c1 is always identical to the scalar previous to c2.)
Look up prev from the NFD trie. If the result of the lookup is not either 1 or 0, let c1 and c2 be prev and continue the loop.
Look up the CE32 for prev.
If the CE32 is a contraction CE32, let c1 and c2 be prev and continue the loop.
If the numeric mode is on and the CE32 is a numeric CE32, let c1 and c2 be prev and continue the loop.
Otherwise, break the loop and start collating forwards with c1 and c2 as the first scalars of the comparison.
(The loop ends.)
Add the following to the builder:
In the builder, create a set "contraction middle". When a contraction is added in the builder, put the scalars in the contraction key other than the first and the last in the "contraction middle" if they are starters that decompose to themselves.
Once the collation data (tailoring or root) has otherwise been built, iterate through the characters in the "contraction middle" set and for each character, make a lookup from the trie. If the CE32 is neither a fallback CE32 with fallback to a contraction CE32 nor a contraction CE32, append the CE32 to as two u16
s to contexts
and overwrite the value in the trie with a contraction CE32 that has the "no trie" flag set and whose index points to the newly-appended default value in contexts
.
In the processing of CE32:
If a contraction CE32 has the "no trie" flag, assign the default value to ce32
and continue 'ce32loop
.