Skip to content

Consider supporting three layers of collation data for search collations #3178

Open
@hsivonen

Description

@hsivonen

In sorting, there are two layers of data: The root collation and, optionally, a language-specific tailoring overlay.

In search, there are logically three layers of data: the root for sorting, a search root overlaid on that, and then, optionally, a language-specific tailoring.

However, the implementation only admits two layers, so for each language that's supposed to reuse its sort tailoring for searching, we end up generating a search tailoring that contains a merge of a copy of the search root and a copy of the sort tailoring for the language. This is obviously bad for data size.

An obvious solution would be to allow three layers: root, search root, and search tailoring. However, this would make search perform worse, since the common case would fall back twice.

(An alternative that I'm considering for Firefox in the context of ICU4C for the time being is to omit the search root when a search tailoring exists and to use the corresponding sort tailoring as-is. That is, for the Latin-script languages that have special rules about which diacritics not to ignore in diacritic-insensitive search, one would lose the fuzziness for the Arabic and Thai scripts. And modern Hangul, but I don't understand the use case for the modern Hangul bits in the search root.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-dataArea: Data coverage or qualityC-collatorComponent: Collation, normalizationT-enhancementType: Nice-to-have but not requiredhelp wantedIssue needs an assignee

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions