Surrogate pairs can be cut in half, leading to corrupted strings

Currently, the line break algorithm interprets max. line length in terms of .NET string length. Strings in .NET use UTF16 encoding but define length as the number of _code units_, not as the number of _code points_. There is a difference as soon as code points beyond the Basic Multilingual Plane are included, since those are represented using two code units (so called surrogate pairs) in UTF16.

The line break algo doesn't consider surrogate pairs at the moment, which is a bug:

Consider a situation in which the line break happens in the middle of a surrogate pair. This will result in invalid code points at the end of the line and at the start of the next one since .NET usually replaces lone surrogates with a placeholder, which ultimately corrupts the string content.

MRE:

```cs
var generator = new POGenerator(new POGeneratorSettings
{
    SkipInfoHeaders = true,
});

var id =
"""
12345678901234567890123456789012345678901234567890123456789012345678901234567💩90
""";

var catalog = new POCatalog { Encoding = "UTF-8" };
var entry = new POSingularEntry(new POKey(id));
catalog.Add(entry);

using var ms = new MemoryStream();
using var writer = new StreamWriter(ms, Encoding.UTF8);
generator.Generate(writer, catalog);
writer.Flush();

var s = Encoding.UTF8.GetString(ms.GetBuffer().AsSpan(0, (int)ms.Length));
```

The content of `s`  will be:

```
﻿msgid ""
msgstr ""
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid ""
"12345678901234567890123456789012345678901234567890123456789012345678901234567�"
"�90"
msgstr ""
```

It seems that the only reasonable solution that could offer a way out of this mess is interpreting max. line length in terms of Unicode code points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Surrogate pairs can be cut in half, leading to corrupted strings #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Surrogate pairs can be cut in half, leading to corrupted strings #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions