Skip to content

Surrogate pairs can be cut in half, leading to corrupted strings #30

@adams85

Description

@adams85

Currently, the line break algorithm interprets max. line length in terms of .NET string length. Strings in .NET use UTF16 encoding but define length as the number of code units, not as the number of code points. There is a difference as soon as code points beyond the Basic Multilingual Plane are included, since those are represented using two code units (so called surrogate pairs) in UTF16.

The line break algo doesn't consider surrogate pairs at the moment, which is a bug:

Consider a situation in which the line break happens in the middle of a surrogate pair. This will result in invalid code points at the end of the line and at the start of the next one since .NET usually replaces lone surrogates with a placeholder, which ultimately corrupts the string content.

MRE:

var generator = new POGenerator(new POGeneratorSettings
{
    SkipInfoHeaders = true,
});

var id =
"""
12345678901234567890123456789012345678901234567890123456789012345678901234567💩90
""";

var catalog = new POCatalog { Encoding = "UTF-8" };
var entry = new POSingularEntry(new POKey(id));
catalog.Add(entry);

using var ms = new MemoryStream();
using var writer = new StreamWriter(ms, Encoding.UTF8);
generator.Generate(writer, catalog);
writer.Flush();

var s = Encoding.UTF8.GetString(ms.GetBuffer().AsSpan(0, (int)ms.Length));

The content of s will be:

msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid ""
"12345678901234567890123456789012345678901234567890123456789012345678901234567�"
"�90"
msgstr ""

It seems that the only reasonable solution that could offer a way out of this mess is interpreting max. line length in terms of Unicode code points.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions