Description
Description
When using Utf8.TryWrite
to write an interpolated string as UTF8 bytes, passing an alignment value with any of the formatted values does not always result in the same amount of padding as when using string.Format
or default string interpolation. If the formatted value has any non-ASCII characters then then less padding will be added.
Reproduction Steps
using System;
using System.Text;
using System.Text.Unicode;
string[] examples = new[]
{
"\u0108", // Ĉ 1 char, 2 bytes UTF8
"\u20ac", // € 1 char, 3 bytes UTF8
"\ud83d\ude00", // 😀 2 chars, 4 bytes UTF8
};
foreach (string s in examples)
{
Console.WriteLine($"utf16: [{s,4}]");
}
foreach (string s in examples)
{
Span<byte> span = new byte[8];
Utf8.TryWrite(span, $"[{s,4}]", out int written);
Console.WriteLine("utf8: " + Encoding.UTF8.GetString(span.Slice(0, written)));
}
Expected behavior
Formatting a value with an alignment in Utf8.TryWrite
should produce the same amount of padding in UTF8 as is added in other .NET string (UTF16) formatted strings.
For the code snippet above, it should produce:
utf16: [ Ĉ]
utf16: [ €]
utf16: [ 😀]
utf8: [ Ĉ]
utf8: [ €]
utf8: [ 😀]
Actual behavior
When the formatted value includes any characters that require more than 1 byte in UTF8 encoding, the alignment is incorrect and produces less padding in Utf8.TryWrite
.
For the code snippet above, it produces:
utf16: [ Ĉ]
utf16: [ €]
utf16: [ 😀]
utf8: [ Ĉ]
utf8: [ €]
utf8: [😀]
Regression?
This has been the behavior since the Utf8.TryWrite
API was introduced in .NET 8, and it is also reproducible in .NET 9.
Known Workarounds
If the correct padding is really needed, default string interpolation or formatting can be used to format the value as UTF16 in a string or a Span<char>
, and then that UTF16 can be encoded into the UTF8 output Span<byte>
using Encoding.UTF8.GetBytes
.
This loses the nice ergonomics of formatting directly into the UTF8 buffer, and either allocates (if making a string) or requires more buffer management to get a Span<char>
.
Configuration
Tested in .NET 8 & .NET 9 Preview
I'm using Windows x64, but I'm pretty sure this is not platform/arch dependent.
Other information
Before I begin -- I am interested in trying to fix this and I'm happy to open a PR for it. It would be my first time contributing however, so I understand if you feel someone else should handle fixing it instead.
The issue is that the amount of required padding is being determined by counting how many bytes were written, even though we're working with UTF8 where many characters take more than one byte. Here's the culprit:
The simple (and maybe too naiive) approach to fix this would be to use Encoding.UTF8.GetCharCount
on the slice that was written, to measure how many char
s the formatted text ended up writing. But this private method is called by multiple overloads of AppendFormatted
and for some of them, we alread know how many char
s we wrote. For example, if the value being formatted is a ReadOnlySpan<char>
or string
, we know how many char
s it had. Or if it was ISpanFormattable
, we already formatted it into our own Span<char>
buffer before writing and know how many char
s there are.
So I think a better solution might be to find a way to have the overloads pass an optional int charsWritten
if they know how many there were. If not, the alignment handling should call Encoding.UTF8.GetCharCount
on the bytes we wrote so far to calculate how many char
s it ended up being.