Skip to content

Utf8.TryWrite applies alignment by counting bytes instead of characters #109615

Open
@jdryden572

Description

@jdryden572

Description

When using Utf8.TryWrite to write an interpolated string as UTF8 bytes, passing an alignment value with any of the formatted values does not always result in the same amount of padding as when using string.Format or default string interpolation. If the formatted value has any non-ASCII characters then then less padding will be added.

Reproduction Steps

using System;
using System.Text;
using System.Text.Unicode;

string[] examples = new[]
{
    "\u0108",       // Ĉ 1 char, 2 bytes UTF8
    "\u20ac",       // € 1 char, 3 bytes UTF8
    "\ud83d\ude00", // 😀 2 chars, 4 bytes UTF8
};

foreach (string s in examples)
{
    Console.WriteLine($"utf16: [{s,4}]");
}
foreach (string s in examples)
{
    Span<byte> span = new byte[8];
    Utf8.TryWrite(span, $"[{s,4}]", out int written);
    Console.WriteLine("utf8:  " + Encoding.UTF8.GetString(span.Slice(0, written)));
}

Expected behavior

Formatting a value with an alignment in Utf8.TryWrite should produce the same amount of padding in UTF8 as is added in other .NET string (UTF16) formatted strings.

For the code snippet above, it should produce:

utf16: [   Ĉ]
utf16: [   €]
utf16: [  😀]
utf8:  [   Ĉ]
utf8:  [   €]
utf8:  [  😀]

Actual behavior

When the formatted value includes any characters that require more than 1 byte in UTF8 encoding, the alignment is incorrect and produces less padding in Utf8.TryWrite.

For the code snippet above, it produces:

utf16: [   Ĉ]
utf16: [   €]
utf16: [  😀]
utf8:  [  Ĉ]
utf8:  [ €]
utf8:  [😀]

Regression?

This has been the behavior since the Utf8.TryWrite API was introduced in .NET 8, and it is also reproducible in .NET 9.

Known Workarounds

If the correct padding is really needed, default string interpolation or formatting can be used to format the value as UTF16 in a string or a Span<char>, and then that UTF16 can be encoded into the UTF8 output Span<byte> using Encoding.UTF8.GetBytes.

This loses the nice ergonomics of formatting directly into the UTF8 buffer, and either allocates (if making a string) or requires more buffer management to get a Span<char>.

Configuration

Tested in .NET 8 & .NET 9 Preview
I'm using Windows x64, but I'm pretty sure this is not platform/arch dependent.

Other information

Before I begin -- I am interested in trying to fix this and I'm happy to open a PR for it. It would be my first time contributing however, so I understand if you feel someone else should handle fixing it instead.

The issue is that the amount of required padding is being determined by counting how many bytes were written, even though we're working with UTF8 where many characters take more than one byte. Here's the culprit:

int paddingNeeded = alignment - bytesWritten;

The simple (and maybe too naiive) approach to fix this would be to use Encoding.UTF8.GetCharCount on the slice that was written, to measure how many chars the formatted text ended up writing. But this private method is called by multiple overloads of AppendFormatted and for some of them, we alread know how many chars we wrote. For example, if the value being formatted is a ReadOnlySpan<char> or string, we know how many chars it had. Or if it was ISpanFormattable, we already formatted it into our own Span<char> buffer before writing and know how many chars there are.

So I think a better solution might be to find a way to have the overloads pass an optional int charsWritten if they know how many there were. If not, the alignment handling should call Encoding.UTF8.GetCharCount on the bytes we wrote so far to calculate how many chars it ended up being.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions