Implement identical prefix skipping without new data structures

I've been thinking how we could compute the information necessary for identical prefix skipping in the collator using the data that we already have. Here's a sketch:

For the following to make sense, #2386 should be fixed first.

On a per-code-unit basis, find the first mismatch. Then move to operating on scalar values where the first mismatching scalars are _c1_ and _c2_.

In all cases with trie lookups in the identical prefix skipping precess, remember the NFD trie value if not for a self-decomposing stater, or, if it is for a self-decomposing starter, the CE32 and whether the CE32 came via fallback. Once we start moving forward, the buffer of these remembered values should be consumed first before starting to pull characters from the underlying iterator. This way, the trie reads here are no worse than what trie reads would be performed in the absence of this optimization. The remembered values here would go into a `SmallVec` in place of the `Option<CharacterAndTrieValue>` contemplated in #2386. (This does introduce per-character branches to decide of there are remembered values to use of if new trie lookups need to be made.)

Allocate a new "no trie" flag for contraction CE32s to indicate that there's no `Chars16Trie`, just the default value.

Assertion: Characters that map to to a prefix CE32 are always self-decomposing starters. (Check if actually true!)

In the loop, the variables not only bind to a scalar value but also its position.

Loop:

Look up _c1_ from the NFD trie. If the result of the lookup is 1 or 0 (starter that decomposes to self), look up the CE32 for _c1_. If the CE32 is a prefix CE32, let _c1_ and _c2_ be the scalar previous to _c1_ and _c2_ (the previous scalar always matches between the inputs) and continue the loop.

Otherwise if _c1_ is not equal to _c2_ (only happens on the first iteration; in implementation hoist this case out of the loop somehow), look up _c2_ from the the NFD trie. If the result of the lookup is 1 or 0, look up the CE32 for _c2_. If the CE32 is a prefix CE32, let _c1_ and _c2_ be the scalar previous to _c1_ and _c2_ and continue the loop.

Let _prev_ be the scalar previous to _c1_ and _c2_. (The scalar previous to _c1_ is always identical to the scalar previous to _c2_.)

Look up _prev_ from the NFD trie. If the result of the lookup is not either 1 or 0, let _c1_ and _c2_ be _prev_ and continue the loop.

Look up the CE32 for _prev_.

If the CE32 is a contraction CE32, let _c1_ and _c2_ be _prev_ and continue the loop.

If the numeric mode is on and the CE32 is a numeric CE32, let _c1_ and _c2_ be _prev_ and continue the loop.

Otherwise, break the loop and start collating forwards with _c1_ and _c2_ as the first scalars of the comparison.

(The loop ends.)

Add the following to the builder:

In the builder, create a set "contraction middle". When a contraction is added in the builder, put the scalars in the contraction key other than the first and the last in the "contraction middle" if they are starters that decompose to themselves.

Once the collation data (tailoring or root) has otherwise been built, iterate through the characters in the "contraction middle" set and for each character, make a lookup from the trie. If the CE32 is neither a fallback CE32 with fallback to a contraction CE32 nor a contraction CE32, append the CE32 to as two `u16`s to `contexts` and overwrite the value in the trie with a contraction CE32 that has the "no trie" flag set and whose index points to the newly-appended default value in `contexts`.

In the processing of CE32:

If a contraction CE32 has the "no trie" flag, assign the default value to `ce32` and continue `'ce32loop`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement identical prefix skipping without new data structures #2712

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement identical prefix skipping without new data structures #2712

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions