Replace normalizer bench data with data from https://github.com/unicode-org/test-corpora/

We should replace normalizer test data with data from https://github.com/unicode-org/test-corpora/ . However, `test-corpora` contains (X)HTML, so it's necessary to make sure that the data that gets used is paragraph-level (as opposed to various headers) text without markup.

To the extend multiple paragraphs/lines from `test-corpora` are combined, the combination should not introduce line breaks or ASCII spaces into languages that wouldn't use spaces normally at the point of combination.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace normalizer bench data with data from https://github.com/unicode-org/test-corpora/ #6881

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace normalizer bench data with data from https://github.com/unicode-org/test-corpora/ #6881

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions