Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Implement region write barriers #111636

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Jan 20, 2025

Extend the Arm64 writebarrier function to support regions. The assembly is updated similar to that for AMD64.

This is expected to make the writebarrier slower, but improve the performance of the GC.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 20, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@kunalspathak
Copy link
Member

kunalspathak commented Jan 21, 2025

FYI - @Maoni0
@mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

@a74nh
Copy link
Contributor Author

a74nh commented Jan 21, 2025

I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

@a74nh
Copy link
Contributor Author

a74nh commented Jan 23, 2025

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great.

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2025

I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the WB_nonephemeral perf is mostly here: https://gist.github.com/EgorBot/a6db6579aba05de6a25f111513cb54b2#file-diff_asm_bcd38073-asm-L30 which is, I guess,

    // Check whether the region we're storing into is gen 0 - nothing to do in this case
    ldrb w12, [x12]
    cbz  w12, LOCAL_LABEL(Exit)

(I guess I should've added an extra benchmark where object we're storing is gen2)

PS: feel free to call the bot yourself if needed

@mrsharm
Copy link
Member

mrsharm commented Jan 24, 2025

FYI - @Maoni0 @mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

  1. Not removing the outliers: --outliers DontRemove.
  2. Setting a fixed number of invocations that'll be high enough to reduce the standard error: --invocationCount {InvocationCount}
  3. Setting a fixed number of iterations: --iterationCount 20.
- System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536*)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 10000)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536*)
- System.Collections.CtorGivenSize<String>.Array(size: 512)
- ByteMark.BenchBitOps
- System.IO.Tests.Perf_File.ReadAllBytes(size: 104857600)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Linq.Tests.Perf_Enumerable.ToArray*
- System.Collections.Tests.Perf_BitArray.BitArrayByteArrayCtor(size: 512)

Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests.

As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember StackWalk being extremely volatile but still worth trying out with.

@cshung
Copy link
Member

cshung commented Jan 24, 2025

As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 5, 2025

@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

@a74nh
Copy link
Contributor Author

a74nh commented Mar 7, 2025

Running the tests from earlier. Cobalt 100 with 8 cores, 32Gb.

For non-ephermeral, the performance drops by 30% (the same as what EgorBot had seen). But then implementing the writebarrier manager, that drop halves.

Ephemeral has a 7% drop. Adding the writebarrier manager doesn't really effect that.

| Method          | Toolchain               | Mean      | Error     | Ratio | Gen0   |
|---------------- |-------------------------|----------:|----------:|------:|-------:|
| WB_nonephemeral | HEAD                    |  4.383 ns | 0.0078 ns |  1.00 |      - |
| WB_nonephemeral | PR original use regions |  5.677 ns | 0.0078 ns |  1.30 |      - |
| WB_nonephemeral | PR writebarriermanager  |  5.080 ns | 0.0049 ns |  1.16 |      - |

| WB_ephemeral    | HEAD                    | 13.353 ns | 0.0890 ns |  1.00 | 0.0004 |
| WB_ephemeral    | PR original use regions | 14.325 ns | 0.1603 ns |  1.07 | 0.0003 |
| WB_ephemeral    | PR writebarriermanager  | 14.003 ns | 0.0800 ns |  1.05 | 0.0003 |

I ran the test again, but this time for the writebarriermanager, I forced it to never used the region barriers (by removing the g_region_shr check in the writebarriermanager).

Now the for WB_nonephemeral, the performance is slightly faster when using writebarriermanager.

Ephemeral has as an 8% speed up too. Given the drop on the original PR too vs HEAD, I'm happy to discount the ephemeral differences as within noise.


| Method          | Toolchain               | Mean      | Error     | Ratio | Gen0   |
|---------------- |-------------------------|----------:|----------:|------:|-------:|
| WB_nonephemeral | HEAD                    |  4.388 ns | 0.0037 ns |  1.00 |      - |
| WB_nonephemeral | PR original use regions |  5.686 ns | 0.0087 ns |  1.30 |      - |
| WB_nonephemeral | PR writebarriermanager  |  4.113 ns | 0.0060 ns |  0.94 |      - |

| WB_ephemeral    | HEAD                    | 13.430 ns | 0.0796 ns |  1.00 | 0.0004 |
| WB_ephemeral    | PR original use regions | 13.614 ns | 0.0710 ns |  1.01 | 0.0004 |
| WB_ephemeral    | PR writebarriermanager  | 12.406 ns | 0.0652 ns |  0.92 | 0.0003 |

I'm happy with these results as region write barriers will always introduce some additional cost. This is coupled with the decrease in GC pause times (shown by the GC perf test here).

Is that good enough for this PR to continue?

(I still need to fix windows/OSX builds and do some additional functionality testing.)

@cshung
Copy link
Member

cshung commented Mar 7, 2025

I ran the test again, but this time for the writebarriermanager, I forced it to never used the region barriers (by removing the g_region_shr check in the writebarriermanager).

I wonder if the DOTNET_GCWriteBarrier environment variable works? In principle, if we specify this to 3 which is WRITE_BARRIER_SERVER, we should not be using any region checks at all. We shouldn't have to modify code to get back to original behavior.

Allowing this will help us with mitigating the performance regression risk since then we can just advise to change the configuration settings as needed.

// From gcconfig.h

enum WriteBarrierFlavor
{
    WRITE_BARRIER_DEFAULT = 0,
    WRITE_BARRIER_REGION_BIT = 1,
    WRITE_BARRIER_REGION_BYTE = 2,
    WRITE_BARRIER_SERVER = 3,
};

@a74nh
Copy link
Contributor Author

a74nh commented Mar 12, 2025

I wonder if the DOTNET_GCWriteBarrier environment variable works? In principle, if we specify this to 3 which is WRITE_BARRIER_SERVER, we should not be using any region checks at all. We shouldn't have to modify code to get back to original behavior.

Yes, I get the same results when setting DOTNET_GCWriteBarrier to 3. With that, the original version of the PR is still stuck at a big loss due to it being a single large writebarrier function.

Method Toolchain Mean Error Ratio
WB_nonephemeral HEAD 4.394 ns 0.0163 ns 1.00
WB_nonephemeral PR original 5.427 ns 0.0016 ns 1.24
WB_nonephemeral PR writebarriermanager 4.094 ns 0.0044 ns 0.93

Allowing this will help us with mitigating the performance regression risk since then we can just advise to change the configuration settings as needed.

Agreed, that makes sense. I'm guessing that is already suggested for X64 users.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 12, 2025

Had to do some reworking to work on MacOS, as MacOS does not allow ldr x12, label1-label2. Had to hardcode some fixed constants. On startup, asserts check that the constants are valid. Tested on MacOS and it looks like it works.

Just Windows left to fix up.

Comment on lines 43 to 45
// Check this is going from old to young
if reg_loc_dst >= reg_loc_ref:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we double check the comparsion direction here? I think the correct direction should be reversed.

Suggested change
// Check this is going from old to young
if reg_loc_dst >= reg_loc_ref:
return
// Return if the new reference is not from old to young
if reg_loc_ref >= reg_loc_dst:
return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we double check the comparsion direction here? I think the correct direction should be reversed.

Yes, your version is correct. Fixed and updated in assembly files too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your version is correct. Fixed and updated in assembly files too.

Thanks for confirming. I notice the changes you pushed is just about the comments. How about the assembly code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assembly code matches the comments :)

@kunalspathak
Copy link
Member

As discussed offline, will be good to do a comparison of workstation vs. server GC impact in pause time and performance numbers.

@a74nh
Copy link
Contributor Author

a74nh commented Mar 14, 2025

As discussed offline, will be good to do a comparison of workstation vs. server GC impact in pause time and performance numbers.

With DOTNET_gcServer=1:

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 4.385 ns 0.0026 ns 1.00 -
WB_nonephemeral PR original 5.673 ns 0.0085 ns 1.29 -
WB_nonephemeral PR writebarriermanager 5.124 ns 0.0325 ns 1.17 -
WB_ephemeral HEAD 13.846 ns 0.2754 ns 1.00 0.0011
WB_ephemeral PR original 14.770 ns 0.3602 ns 1.07 0.0011
WB_ephemeral PR writebarriermanager 14.590 ns 0.3060 ns 1.05 0.0011

With DOTNET_gcServer=0:

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 4.383 ns 0.0078 ns 1.00 -
WB_nonephemeral PR original 5.677 ns 0.0078 ns 1.30 -
WB_nonephemeral PR writebarriermanager 5.080 ns 0.0049 ns 1.16 -
WB_ephemeral HEAD 13.353 ns 0.0890 ns 1.00 0.0004
WB_ephemeral PR original 14.325 ns 0.1603 ns 1.07 0.0003
WB_ephemeral PR writebarriermanager 14.003 ns 0.0800 ns 1.05 0.0003

With DOTNET_gcServer=1 DOTNET_GCWriteBarrier=3:

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 4.375 ns 0.0041 ns 1.00 -
WB_nonephemeral PR original 5.434 ns 0.0006 ns 1.24 -
WB_nonephemeral PR writebarriermanager 3.195 ns 0.0040 ns 0.73 -
WB_ephemeral HEAD 13.340 ns 0.2573 ns 1.00 0.0012
WB_ephemeral PR original 14.192 ns 0.3018 ns 1.06 0.0011
WB_ephemeral PR writebarriermanager 12.431 ns 0.1251 ns 0.93 0.0011

With DOTNET_gcServer=0 DOTNET_GCWriteBarrier=3:

Method Toolchain Mean Error Ratio Gen0
WB_nonephemeral HEAD 4.394 ns 0.0163 ns 1.00
WB_nonephemeral PR original 5.427 ns 0.0016 ns 1.24
WB_nonephemeral PR writebarriermanager 4.094 ns 0.0044 ns 0.93
WB_ephemeral HEAD 11.468 ns 0.0392 ns 1.00 0.0003
WB_ephemeral PR original 11.654 ns 0.0572 ns 1.02 0.0003
WB_ephemeral PR writebarriermanager 11.655 ns 0.0699 ns 1.02 0.0003

As I hoped, GC server is giving similar results to GC workstation. This makes sense as server and workstation will be using the same writebarrier when GCWriteBarrier=0, and different but short versions when GCWriteBarrier=3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants