We should replace normalizer test data with data from https://github.com/unicode-org/test-corpora/ . However, test-corpora contains (X)HTML, so it's necessary to make sure that the data that gets used is paragraph-level (as opposed to various headers) text without markup.
To the extend multiple paragraphs/lines from test-corpora are combined, the combination should not introduce line breaks or ASCII spaces into languages that wouldn't use spaces normally at the point of combination.