Consider supporting three layers of collation data for search collations

In sorting, there are two layers of data: The root collation and, optionally, a language-specific tailoring overlay.

In search, there are logically three layers of data: the root for sorting, a search root overlaid on that, and then, optionally, a language-specific tailoring.

However, the implementation only admits two layers, so for each language that's supposed to reuse its sort tailoring for searching, we end up generating a search tailoring that contains a merge of a copy of the search root and a copy of the sort tailoring for the language. This is obviously bad for data size.

An obvious solution would be to allow three layers: root, search root, and search tailoring. However, this would make search perform worse, since the common case would fall back twice.

(An alternative that I'm considering for Firefox in the context of ICU4C for the time being is to omit the search root when a search tailoring exists and to use the corresponding sort tailoring as-is. That is, for the Latin-script languages that have special rules about which diacritics _not to_ ignore in diacritic-insensitive search, one would lose the fuzziness for the Arabic and Thai scripts. And modern Hangul, but I don't understand the use case for the modern Hangul bits in the search root.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider supporting three layers of collation data for search collations #3178

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider supporting three layers of collation data for search collations #3178

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions