-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Currently, the line break algorithm interprets max. line length in terms of .NET string length. Strings in .NET use UTF16 encoding but define length as the number of code units, not as the number of code points. There is a difference as soon as code points beyond the Basic Multilingual Plane are included, since those are represented using two code units (so called surrogate pairs) in UTF16.
The line break algo doesn't consider surrogate pairs at the moment, which is a bug:
Consider a situation in which the line break happens in the middle of a surrogate pair. This will result in invalid code points at the end of the line and at the start of the next one since .NET usually replaces lone surrogates with a placeholder, which ultimately corrupts the string content.
MRE:
var generator = new POGenerator(new POGeneratorSettings
{
SkipInfoHeaders = true,
});
var id =
"""
12345678901234567890123456789012345678901234567890123456789012345678901234567💩90
""";
var catalog = new POCatalog { Encoding = "UTF-8" };
var entry = new POSingularEntry(new POKey(id));
catalog.Add(entry);
using var ms = new MemoryStream();
using var writer = new StreamWriter(ms, Encoding.UTF8);
generator.Generate(writer, catalog);
writer.Flush();
var s = Encoding.UTF8.GetString(ms.GetBuffer().AsSpan(0, (int)ms.Length));The content of s will be:
msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
msgid ""
"12345678901234567890123456789012345678901234567890123456789012345678901234567�"
"�90"
msgstr ""
It seems that the only reasonable solution that could offer a way out of this mess is interpreting max. line length in terms of Unicode code points.