Skip to content

Replace normalizer bench data with data from https://github.com/unicode-org/test-corpora/ #6881

@hsivonen

Description

@hsivonen

We should replace normalizer test data with data from https://github.com/unicode-org/test-corpora/ . However, test-corpora contains (X)HTML, so it's necessary to make sure that the data that gets used is paragraph-level (as opposed to various headers) text without markup.

To the extend multiple paragraphs/lines from test-corpora are combined, the combination should not introduce line breaks or ASCII spaces into languages that wouldn't use spaces normally at the point of combination.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-performanceArea: Performance (CPU, Memory)C-collatorComponent: Collation, normalizationgood first issueGood for newcomershelp wantedIssue needs an assignee

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions