Skip to content

Add APIs to BlobBuilder for customizing the underlying byte array et al. #115294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

teo-tsirpanis
Copy link
Contributor

@teo-tsirpanis teo-tsirpanis commented May 5, 2025

Fixes #99244
Fixes #100418

This PR builds on top of @jaredpar's branch to add APIs for customizing the underlying buffer of a BlobBuilder. The chunking logic of BlobBuilder was updated to allocate multiple additional chunks with a user-customizable maximum size each. As part of this, we use APIs from System.Text.Unicode.Utf8 to encode UTF-8 strings, which increases performance and safety, and reduces duplicate code.

@Copilot Copilot AI review requested due to automatic review settings May 5, 2025 02:04
@ghost
Copy link

ghost commented May 5, 2025

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
@ghost
Copy link

ghost commented May 5, 2025

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@teo-tsirpanis teo-tsirpanis marked this pull request as draft May 5, 2025 02:04
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 5, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new APIs to BlobBuilder for customizing the underlying byte array and updates related encoding and buffer-handling logic across metadata and core libraries. Key changes include replacing legacy UTF-8 encoding code with calls to the new System.Text.Unicode.Utf8 APIs, updating BlobBuilder’s API surface (including new constructors and properties), and adding NET-specific intrinsics support across several core modules.

Reviewed Changes

Copilot reviewed 31 out of 36 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
System/Reflection/Internal/Utilities/StreamExtensions.cs Removed obsolete TryReadAll overload for Span to rely on newer API paths.
System/Reflection/Internal/Utilities/BlobUtilities.cs Rewrote WriteUtf8 to use Utf8.FromUtf16 for encoding, replacing manual UTF-8 encoding logic.
System/Reflection/Metadata.cs Added new BlobBuilder constructors, properties, and APIs including ReadOnlySpan/WriteBytes overloads.
System.Private.CoreLib (various files) Updated intrinsics and preprocessor conditions (#if NET, #if SYSTEM_PRIVATE_CORELIB) for newer vectorized and ASCII helper routines.
Microsoft.Bcl.Memory (PACKAGE.md and others) Updated documentation and type forwarding to include UTF-8 APIs for NET platforms.
Files not reviewed (5)
  • src/libraries/Microsoft.Bcl.Memory/src/Microsoft.Bcl.Memory.csproj: Language not supported
  • src/libraries/Microsoft.Bcl.Memory/tests/Microsoft.Bcl.Memory.Tests.csproj: Language not supported
  • src/libraries/System.Reflection.Metadata/System.Reflection.Metadata.sln: Language not supported
  • src/libraries/System.Reflection.Metadata/src/Resources/Strings.resx: Language not supported
  • src/libraries/System.Reflection.Metadata/src/System.Reflection.Metadata.csproj: Language not supported

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-reflection-metadata
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented May 5, 2025

As part of this, we use APIs from System.Text.Unicode.Utf8 to encode UTF-8 strings, which increases performance and safety, and reduces duplicate code.

Is this needed to introduce the new APIs?

System.Text.Unicode.Utf8 change introduces a new dependency for System.Reflection.Metadata on .NET Framework that will be an extra work to push through the system. It would be better to avoid bundling the two changes together in a single PR.

@am11
Copy link
Member

am11 commented May 5, 2025

As part of this, we use APIs from System.Text.Unicode.Utf8 to encode UTF-8 strings, which increases performance and safety, and reduces duplicate code.

Is this needed to introduce the new APIs?

The other PR is approved and ready to merge #111292. After the merge and rebase this branch against main, those commits will disappear.

@jkotas
Copy link
Member

jkotas commented May 5, 2025

The other PR is approved and ready to merge #111292. After the merge and rebase this branch against main, those commits will disappear.

The change that introduces System.Reflection.Metadata dependency on Microsoft.Bcl.Memory won't disappear.

@teo-tsirpanis
Copy link
Contributor Author

Switched back to the old WriteUTF8 implementation for downlevel frameworks, which was rewritten to use spans and remove unsafe code. On modern .NET we still use System.Text.Unicode.Utf8.

Tests pass locally. This is ready for review.

@teo-tsirpanis teo-tsirpanis marked this pull request as ready for review May 10, 2025 11:39
@teo-tsirpanis teo-tsirpanis requested a review from steveharter May 10, 2025 11:39
@jkotas
Copy link
Member

jkotas commented May 10, 2025

WriteUTF8 implementation for downlevel frameworks, which was rewritten to use spans and remove unsafe code

What''s the performance regression introduced by this rewrite on .NET Framework?

Our primary interest in removing unsafe code is on latest .NET. It is fine to keep unsafe code for .NET Framework if it is required for decent performance.

@teo-tsirpanis
Copy link
Contributor Author

I updated the function to use unsafe code and wrote a benchmark to compare it with my initial safe edition. We cannot compare it with the existing unsafe implementation since the functions don't have the same signature and semantics. The numbers look promising so I switched to the unsafe edition:


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5796/22H2/2022Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256
  DefaultJob : .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256


Method N Mean Error StdDev Ratio RatioSD
Safe 16 276.9 ns 5.38 ns 4.77 ns 1.00 0.02
Unsafe 16 173.1 ns 3.05 ns 2.85 ns 0.63 0.01
Safe 128 1,946.7 ns 35.89 ns 31.81 ns 1.00 0.02
Unsafe 128 1,210.2 ns 17.84 ns 15.81 ns 0.62 0.01
Benchmark code
// See https://aka.ms/new-console-template for more information
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.InteropServices;
using System.Text;

BenchmarkRunner.Run<Utf8Bench>();

public class Utf8Bench
{
    public string TestString = null!;

    public byte[] TestBytes = new byte[2048];

    [Params(16, 128)]
    public int N { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var sb = new StringBuilder();
        for (int i = 0; i < N; i++)
        {
            sb.Append('a');
        }
        for (int i = 0; i < N; i++)
        {
            sb.Append('Θ');
        }
        for (int i = 0; i < N; i++)
        {
            sb.Append("😂");
        }
        TestString = sb.ToString();
        TestBytes = new byte[2048];
    }

    [Benchmark(Baseline = true)]
    public int Safe()
    {
        WriteUtf8Safe(TestString.AsSpan(), TestBytes, out int charsRead, out int bytesWritten, true);
        return charsRead + bytesWritten;
    }

    [Benchmark]
    public int Unsafe()
    {
        WriteUtf8Unsafe(TestString.AsSpan(), TestBytes, out int charsRead, out int bytesWritten, true);
        return charsRead + bytesWritten;
    }

    public static void WriteUtf8Safe(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten, bool allowUnpairedSurrogates)
    {
        // Copy from PR
    }

    public static unsafe void WriteUtf8Unsafe(ReadOnlySpan<char> source, Span<byte> destination, out int charsRead, out int bytesWritten, bool allowUnpairedSurrogates)
    {
        // Copy from PR
    }
}

Comment on lines +205 to +214
/// <summary>
/// Changes the size of the byte array underpinning the <see cref="BlobBuilder"/>.
/// Derived types can override this method to control the allocation strategy.
/// </summary>
/// <param name="capacity">The array's new size.</param>
/// <seealso cref="Capacity"/>
protected virtual void SetCapacity(int capacity)
{
Array.Resize(ref _buffer, Math.Max(MinChunkSize, capacity));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how subclasses are supposed to override this. They will reassign the Buffer property, but according to a comment in #100418, reassigning Buffer clears the head chunk.

What are the semantics? Should Capacity's setter have additional logic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants