Arm64: Implement region write barriers #111636

a74nh · 2025-01-20T18:14:03Z

(@Maoni0 will merge this PR when all the data is collected)

Extend the Arm64 writebarrier function to support regions and use the WriteBarrierManager, similar to Amd64. This results in 10 different versions of the JIT_WriteBarrier, with the WriteBarrierManager deciding on which version to use.

Pseudo code for the writebarrier is included in GC-write-barriers.md

This is expected to make the writebarrier slower, but improve the performance of the GC. DOTNET_GCWriteBarrier=3 can be used give the same functionality as before this change.

The behavior of the writebarrier is:
Before the PR: check ephemeral bounds, update a byte in the card table, mark the card bundle
After the PR:
DOTNET_GCWriteBarrier=1 (default, bit region write barriers): check ephemeral bounds, check regions, update a bit in the card table, mark the card bundle
DOTNET_GCWriteBarrier=2 (byte region write barriers): check ephemeral bounds, check regions, update a byte in the card table, mark the card bundle
DOTNET_GCWriteBarrier=3 (server write barriers): check ephemeral bounds, update a byte in the card table, mark the card bundle. This is the same as before the PR.
DOTNET_gcServer=1: update a byte in the card table, mark the card bundle.

Test results on an 8 core Cobalt 100.

Ephemeral test (dotnet/performance)

WB_nonephemeral : -20%
WB_ephemeral: -16%

WKS GC is calculating the generation of regions in addition to comparing with g_ephemeral_low/high". So while it might set fewer cards, it is more expensive and it shows.

With DOTNET_GCWriteBarrier=3:
WB_nonephemeral : +15%
WB_ephemeral: +1%

SVR GC WB also became more expensive but it sets way fewer cards (for nonephemeral it should set almost no cards).

GCPerfsim

Flags: -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0

No environment variables set:
Gen0 pause: -21.06%. Gen1 pause -14.25%

DOTNET_GCWriteBarrier=2:
Gen0 pause: -6.7%. Gen1 pause -2.78%

DOTNET_GCWriteBarrier=3 :
Gen0 pause: -1.37%. Gen1 pause -1.26%

DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
Gen0 pause: -7.24%. Gen1 pause -3.49%

Above are linux numbers. On windows for no env var set we are seeing not as much but still quite noticeable pause improvements around 8% to 10% for this config of GCPerfSim.

	Baseline	13608	Diff: 13608	Diff %: 13608
Process ID	19732	13608
Process Name	corerun	corerun
Commandline	corerun.exe GCPerfSim.dll -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0	corerun.exe GCPerfSim.dll -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0
Process Duration (Sec)	35.945	32.834	-3.111	-8.655
Total Allocated MB	215,230.37	215,263.87	33.505	0.016
Max Size Peak MB	4,444.05	4,505.40	61.357	1.381
GC Count	38,865.00	38,728.00	-137	-0.353
Heap Count	1	1	0	0
Gen0 Count	3,076.00	3,646.00	570	18.531
Gen1 Count	35,774.00	35,067.00	-707	-1.976
Ephemeral Count	38,850.00	38,713.00	-137	-0.353
Gen2 Blocking Count	1	1	0	0
BGC Count	14	14	0	0
Gen0 Total Pause Time MSec	1,302.02	1,386.41	84.388	6.481
Gen1 Total Pause Time MSec	16,992.42	14,964.89	-2,027.52	-11.932
Ephemeral Total Pause Time MSec	18,294.43	16,351.30	-1,943.14	-10.621
Blocking Gen2 Total Pause Time MSec	2.319	2.271	-0.048	-2.07
BGC Total Pause Time MSec	4.225	4.44	0.215	5.081
GC Pause Time %	50.914	49.82	-1.093	-2.148
Avg. Gen0 Pause Time (ms)	0.423	0.38	-0.043	-10.165
Avg. Gen1 Pause Time (ms)	0.475	0.427	-0.048	-10.156
Avg. Gen0 Promoted (mb)	0.862	0.8	-0.061	-7.119
Avg. Gen1 Promoted (mb)	0.783	0.787	0.004	0.573
Avg. Gen0 Speed (mb/ms)	2.036	2.105	0.069	3.391
Avg. Gen1 Speed (mb/ms)	1.648	1.845	0.197	11.943

Looking at the card marking speed it's clearly improved -

Orchard CMS benchmark

+~2% reqs/sec

src/coreclr/vm/arm64/patchedcode.S

kunalspathak · 2025-01-21T15:40:41Z

FYI - @Maoni0
@mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

a74nh · 2025-01-21T18:18:06Z

I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/

src/coreclr/vm/arm64/asmhelpers.S

EgorBo · 2025-01-23T15:00:25Z

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

a74nh · 2025-01-23T15:06:46Z

@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64)

I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great.

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

src/coreclr/vm/gcenv.ee.cpp

EgorBo · 2025-01-23T15:15:36Z

I've been using your orchard.sh script that runs on a single machine, on 4 cores (+1 for wrk). I don't see any improvement in reqs per sec, although not sure if that's a good enough test.

Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the orchard.sh should have USE_DOTNET_TRACE property that you need to set to 1 to grab traces (and set DOTNET_TRACE_ARGS to listen to gc events specifically)

EgorBo · 2025-01-23T15:44:18Z

@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

EgorBo · 2025-01-23T16:48:11Z

I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the WB_nonephemeral perf is mostly here: https://gist.github.com/EgorBot/a6db6579aba05de6a25f111513cb54b2#file-diff_asm_bcd38073-asm-L30 which is, I guess,

    // Check whether the region we're storing into is gen 0 - nothing to do in this case
    ldrb w12, [x12]
    cbz  w12, LOCAL_LABEL(Exit)

(I guess I should've added an extra benchmark where object we're storing is gen2)

PS: feel free to call the bot yourself if needed

src/coreclr/vm/gcenv.ee.cpp

mrsharm · 2025-01-24T16:08:19Z

FYI - @Maoni0 @mrsharm @cshung - what preliminary tests can we run to validate the performance impact?

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

Not removing the outliers: --outliers DontRemove.
Setting a fixed number of invocations that'll be high enough to reduce the standard error: --invocationCount {InvocationCount}
Setting a fixed number of iterations: --iterationCount 20.

- System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536*)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 1000, *)
- System.Tests.Perf_GC<Byte>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Char>.AllocateUninitializedArray(length: 10000, *)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 1000)
- System.Tests.Perf_GC<Char>.NewOperator_Array(length: 10000)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536*)
- System.Collections.CtorGivenSize<String>.Array(size: 512)
- ByteMark.BenchBitOps
- System.IO.Tests.Perf_File.ReadAllBytes(size: 104857600)
- System.IO.Tests.Perf_File.ReadAllBytesAsync(size: 104857600)
- System.Linq.Tests.Perf_Enumerable.ToArray*
- System.Collections.Tests.Perf_BitArray.BitArrayByteArrayCtor(size: 512)

Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests.

As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember StackWalk being extremely volatile but still worth trying out with.

cshung · 2025-01-24T18:45:02Z

As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards.

a74nh · 2025-01-27T12:48:23Z

Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:

running most of the tests as suggested, I don't see any differences. Everything seems within error margins:



| Method                     | Job        | Toolchain                                                                          | length | pinned | Mean        | Error     | StdDev    | Median      | Min         | Max        | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Gen1   | Gen2   | Allocated | Alloc Ratio |
|--------------------------- |----------- |----------------------------------------------------------------------------------- |------- |------- |------------:|----------:|----------:|------------:|------------:|-----------:|------:|---------------- |--------:|-------:|-------:|-------:|----------:|------------:|
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   129.78 ns | 53.253 ns | 61.326 ns |   118.07 ns |   108.50 ns |   388.8 ns |  1.08 | Baseline        |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   137.49 ns | 53.415 ns | 61.512 ns |   125.97 ns |   116.80 ns |   396.9 ns |  1.15 | Same            |    0.54 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | False  |   103.60 ns | 51.462 ns | 59.263 ns |    89.10 ns |    88.63 ns |   354.8 ns |  1.11 | Baseline        |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | False  |   103.35 ns | 51.294 ns | 59.070 ns |    88.76 ns |    88.21 ns |   353.4 ns |  1.10 | Same            |    0.65 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   744.34 ns |  7.498 ns |  8.634 ns |   741.62 ns |   735.19 ns |   764.7 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   743.07 ns |  9.170 ns | 10.561 ns |   740.52 ns |   732.56 ns |   763.7 ns |  1.00 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | True   |   735.06 ns | 10.791 ns | 12.426 ns |   728.98 ns |   720.78 ns |   757.2 ns |  1.00 | Baseline        |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | True   |   748.82 ns |  8.844 ns | 10.185 ns |   743.99 ns |   736.23 ns |   767.8 ns |  1.02 | Same            |    0.02 | 0.6364 | 0.6364 | 0.6364 |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   626.94 ns | 39.042 ns | 44.961 ns |   618.03 ns |   588.73 ns |   805.0 ns |  1.00 | Baseline        |    0.09 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   623.92 ns | 74.318 ns | 85.585 ns |   601.31 ns |   589.99 ns |   983.1 ns |  1.00 | Same            |    0.15 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | False  |   142.84 ns | 17.866 ns | 20.575 ns |   138.18 ns |   134.39 ns |   228.9 ns |  1.01 | Baseline        |    0.17 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | False  |   149.25 ns | 16.513 ns | 19.016 ns |   146.35 ns |   137.79 ns |   227.3 ns |  1.06 | Same            |    0.16 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateArray              | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,592.21 ns | 32.371 ns | 37.278 ns | 2,585.44 ns | 2,550.16 ns | 2,707.3 ns |  1.00 | Baseline        |    0.02 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateArray              | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,475.21 ns | 76.425 ns | 88.011 ns | 2,436.47 ns | 2,379.59 ns | 2,637.6 ns |  0.96 | Same            |    0.04 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| AllocateUninitializedArray | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | True   | 2,438.40 ns | 43.482 ns | 50.074 ns | 2,444.35 ns | 2,330.27 ns | 2,527.3 ns |  1.00 | Baseline        |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
| AllocateUninitializedArray | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | True   | 2,449.01 ns | 35.429 ns | 40.800 ns | 2,448.20 ns | 2,338.34 ns | 2,520.9 ns |  1.00 | Same            |    0.03 | 6.3182 | 6.3182 | 6.3182 |  19.56 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1000   | ?      |    98.53 ns | 49.747 ns | 57.289 ns |    86.26 ns |    74.80 ns |   340.4 ns |  1.11 | Baseline        |    0.67 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1000   | ?      |    95.01 ns | 48.560 ns | 55.922 ns |    80.60 ns |    79.98 ns |   331.4 ns |  1.07 | Same            |    0.66 | 0.0152 |      - |      - |   1.98 KB |        1.00 |
|                            |            |                                                                                    |        |        |             |           |           |             |             |            |       |                 |         |        |        |        |           |             |
| NewOperator_Array          | Job-CWSSJX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 10000  | ?      |   546.14 ns | 49.634 ns | 57.159 ns |   533.12 ns |   520.12 ns |   784.7 ns |  1.01 | Baseline        |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |
| NewOperator_Array          | Job-EDJSHX | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 10000  | ?      |   551.71 ns | 52.751 ns | 60.748 ns |   537.58 ns |   528.97 ns |   807.3 ns |  1.02 | Same            |    0.13 | 0.2879 |      - |      - |  19.55 KB |        1.00 |


| Method | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   205.72 ns | 129.897 ns | 149.589 ns |    84.26 ns |    71.82 ns |   404.32 ns |  1.78 | Baseline        |    1.85 |      - |     160 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   203.72 ns | 129.080 ns | 148.649 ns |    83.54 ns |    72.15 ns |   400.73 ns |  1.76 | Same            |    1.84 |      - |     160 B |        1.00 |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    25.58 ns |   0.439 ns |   0.505 ns |    25.63 ns |    23.68 ns |    26.00 ns |  1.00 | Baseline        |    0.03 |      - |         - |          NA |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    24.67 ns |   1.307 ns |   1.506 ns |    24.99 ns |    21.90 ns |    26.31 ns |  0.97 | Same            |    0.06 |      - |         - |          NA |
|        |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Add    | Job-VIYVLB | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,591.60 ns |  74.221 ns |  85.473 ns | 3,559.69 ns | 3,555.19 ns | 3,919.99 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8224 B |        1.00 |
| Add    | Job-VRIONI | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,571.79 ns |  69.881 ns |  80.475 ns | 3,551.91 ns | 3,546.31 ns | 3,911.55 ns |  0.99 | Same            |    0.03 | 0.1212 |    8224 B |        1.00 |


| Method   | Job        | Toolchain                                                                          | arguments        | Mean        | Error      | StdDev     | Median      | Min         | Max         | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|--------- |----------- |----------------------------------------------------------------------------------- |----------------- |------------:|-----------:|-----------:|------------:|------------:|------------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 1024,1024 bits   |   145.80 ns | 116.856 ns | 134.571 ns |    72.70 ns |    72.08 ns |   426.39 ns |  1.59 | Baseline        |    1.70 |      - |     152 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 1024,1024 bits   |   143.24 ns | 118.524 ns | 136.493 ns |    72.22 ns |    71.90 ns |   431.54 ns |  1.57 | Same            |    1.72 |      - |     152 B |        1.00 |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 16,16 bits       |    26.41 ns |   0.836 ns |   0.963 ns |    26.88 ns |    24.34 ns |    27.34 ns |  1.00 | Baseline        |    0.05 |      - |         - |          NA |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 16,16 bits       |    26.22 ns |   0.666 ns |   0.767 ns |    26.29 ns |    24.35 ns |    27.18 ns |  0.99 | Same            |    0.05 |      - |         - |          NA |
|          |            |                                                                                    |                  |             |            |            |             |             |             |       |                 |         |        |           |             |
| Subtract | Job-KDZVCP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 65536,65536 bits | 3,483.97 ns |  61.051 ns |  70.306 ns | 3,466.17 ns | 3,458.38 ns | 3,780.31 ns |  1.00 | Baseline        |    0.03 | 0.1212 |    8216 B |        1.00 |
| Subtract | Job-KKPRIL | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 65536,65536 bits | 3,526.84 ns |  71.010 ns |  81.775 ns | 3,504.11 ns | 3,480.61 ns | 3,840.66 ns |  1.01 | Same            |    0.03 | 0.1212 |    8216 B |        1.00 |


| Method | Job        | Toolchain                                                                          | Size | Mean     | Error   | StdDev  | Median   | Min      | Max      | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|------- |----------- |----------------------------------------------------------------------------------- |----- |---------:|--------:|--------:|---------:|---------:|---------:|------:|---------------- |--------:|-------:|----------:|------------:|
| Array  | Job-CZKOLC | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 152.8 ns | 7.44 ns | 8.56 ns | 149.6 ns | 147.4 ns | 186.8 ns |  1.00 | Baseline        |    0.07 | 0.0606 |   4.02 KB |        1.00 |
| Array  | Job-FQHBTF | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 155.3 ns | 4.66 ns | 5.36 ns | 154.5 ns | 151.6 ns | 177.2 ns |  1.02 | Same            |    0.06 | 0.0606 |   4.02 KB |        1.00 |


| Method  | Job        | Toolchain                                                                          | input       | Mean      | Error    | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|-------- |----------- |----------------------------------------------------------------------------------- |------------ |----------:|---------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | ICollection |  41.88 ns | 9.097 ns | 10.476 ns |  37.78 ns |  36.30 ns |  80.16 ns |  1.04 | Baseline        |    0.30 | 0.0061 |     424 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | ICollection |  43.15 ns | 9.478 ns | 10.915 ns |  36.91 ns |  36.21 ns |  79.58 ns |  1.07 | Same            |    0.31 | 0.0061 |     424 B |        1.00 |
|         |            |                                                                                    |             |           |          |           |           |           |           |       |                 |         |        |           |             |
| ToArray | Job-QHOIJP | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | IEnumerable | 287.98 ns | 5.110 ns |  5.885 ns | 286.38 ns | 285.61 ns | 312.59 ns |  1.00 | Baseline        |    0.03 | 0.0061 |     456 B |        1.00 |
| ToArray | Job-GOWGBS | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | IEnumerable | 289.73 ns | 4.845 ns |  5.580 ns | 287.99 ns | 287.74 ns | 313.07 ns |  1.01 | Same            |    0.03 | 0.0061 |     456 B |        1.00 |


| Method                | Job        | Toolchain                                                                          | Size | Mean      | Error     | StdDev    | Median    | Min       | Max       | Ratio | MannWhitney(2%) | RatioSD | Gen0   | Allocated | Alloc Ratio |
|---------------------- |----------- |----------------------------------------------------------------------------------- |----- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |--------:|-------:|----------:|------------:|
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 4    |  21.91 ns | 11.631 ns | 13.395 ns |  14.75 ns |  14.65 ns |  57.33 ns |  1.24 | Baseline        |    0.88 |      - |      64 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 4    |  22.39 ns | 11.757 ns | 13.540 ns |  15.89 ns |  15.70 ns |  60.19 ns |  1.27 | Same            |    0.90 |      - |      64 B |        1.00 |
|                       |            |                                                                                    |      |           |           |           |           |           |           |       |                 |         |        |           |             |
| BitArrayByteArrayCtor | Job-WNOFTX | /runtime_base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | 512  | 142.08 ns |  5.946 ns |  6.848 ns | 140.73 ns | 138.30 ns | 170.18 ns |  1.00 | Baseline        |    0.06 | 0.0076 |     568 B |        1.00 |
| BitArrayByteArrayCtor | Job-QPXJRV | /runtime_table/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | 512  | 139.35 ns |  5.774 ns |  6.650 ns | 137.68 ns | 136.98 ns | 167.37 ns |  0.98 | Same            |    0.06 | 0.0076 |     568 B |        1.00 |

a74nh · 2025-03-20T14:29:09Z

@EgorBot -linux_ampere -linux_cobalt100 -windows_cobalt100 -profiler --envvars DOTNET_GCWriteBarrier:3

using BenchmarkDotNet.Attributes;

public class MyBench
{
    object Dst1;
    object Dst2;
    object Dst3;
    object Dst4;

    static object Value = new();

    static MyBench()
    {
        GC.Collect();
        GC.Collect();
    }

    [Benchmark]
    public void WB_nonephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = Value;
        Dst2 = Value;
        Dst3 = Value;
        Dst4 = Value;
    }

    [Benchmark]
    public void WB_ephemeral()
    {
        // Write non-ephemeral reference
        Dst1 = new object();
    }
}

a74nh · 2025-03-20T16:20:54Z

WriteBarrier 3 results are a little better than expected. With this we're using the old writebarrier, except it has one or two fewer checks to do. With this it's showing a gain. Oddly windows has 20% gain for nonephemeral!

a74nh · 2025-03-21T15:16:15Z

I noticed that the ShadowUpdate code is never called, as g_GCShadow is always 0. It is only ever set if DOTNET_HeapVerify is set.

Removing the g_GCShadow checks from the writebarrier gives:

DOTNET_GCWriteBarrier=0

Method	Job	Toolchain	Mean	Error	Ratio	Gen0
WB_nonephemeral	Job-HYVKPP	HEAD	4.407 ns	0.0572 ns	1.00	-
WB_nonephemeral	Job-HJOGLM	PR	4.511 ns	0.0579 ns	1.02	-

WB_ephemeral	Job-HYVKPP	HEAD	12.036 ns	0.2587 ns	1.00	0.0003
WB_ephemeral	Job-HJOGLM	PR	12.548 ns	0.2553 ns	1.04	0.0003

DOTNET_GCWriteBarrier=3

Method	Job	Toolchain	Mean	Error	Ratio	Gen0
WB_nonephemeral	Job-BEPCOQ	HEAD	4.421 ns	0.1165 ns	1.00	-
WB_nonephemeral	Job-VWEJCQ	PR	3.195 ns	0.0033 ns	0.72	-

WB_ephemeral	Job-BEPCOQ	HEAD	11.898 ns	0.0826 ns	1.00	0.0003
WB_ephemeral	Job-VWEJCQ	PR	11.674 ns	0.1842 ns	0.98	0.0003

It has removed all the slowdown added by this PR, and given additional perf when writebarrier=3.

Looking at Am64, when g_GCShadow is set, it uses JIT_WriteBarrier_Debug in jithelpers_slow.S. Annoyingly it's another complete copy of the writebarrier function. I'll look at doing something similar for Arm64 - either by doing it the same way or extending writebarriermanager to switch on shadow too, giving us 16 functions. Either way I want to write the assembly using the macros to avoid copy/paste errors.

(Note I'll be away for 2 weeks, so will implement when I get back)

a74nh · 2025-03-21T15:55:32Z

I did some runs of Orchard CMS based on Egor's script, on Cobalt 100:

HEAD:
Requests/sec: 5171.91
Requests/sec: 5201.02
Requests/sec: 5235.64

PR:
Requests/sec: 5326.45
Requests/sec: 5309.99
Requests/sec: 5298.49

So a couple of percent better overall with the PR.

I tried with the GCShadow checks removed, but figures looks identical to the PR.

jkotas · 2025-03-23T02:11:14Z

I tried with the GCShadow checks removed, but figures looks identical to the PR.

GCShadow should be present in debug and checked builds of the runtime only. They should not be present in release builds of the runtime.

I assume that all perf measurements are done on a release build. Is that correct? So it makes sense that removing GCShadow checks has no impact on the results.

a74nh · 2025-03-26T15:01:17Z

I tried with the GCShadow checks removed, but figures looks identical to the PR.

GCShadow should be present in debug and checked builds of the runtime only. They should not be present in release builds of the runtime.

I assume that all perf measurements are done on a release build. Is that correct? So it makes sense that removing GCShadow checks has no impact on the results.

Yes, on a release build WRITE_BARRIER_CHECK shouldn't be defined. I'll double check to make sure I've been using release for the micro benchmarks.

a74nh · 2025-04-07T11:00:41Z

Orchard CMS results were using a Release build. So the figures above, with ~100 Requests/sec improvement are correct.

However, my Ephemeral tests were using a Checked build. Here's using a Release build. These match better to the EgorBot results.

Method	Toolchain	Mean	Error	Ratio	Gen0
WB_nonephemeral	HEAD	3.742 ns	0.0046 ns	1.00	-
WB_nonephemeral	PR	4.489 ns	0.0077 ns	1.20	-

WB_ephemeral	HEAD	5.314 ns	0.1318 ns	1.00	0.0004
WB_ephemeral	PR	6.176 ns	0.0531 ns	1.16	0.0003

With DOTNET_GCWriteBarrier=3

Method	Toolchain	Mean	Error	Ratio	Gen0
WB_nonephemeral	HEAD	3.742 ns	0.0077 ns	1.00	-
WB_nonephemeral	PR	3.168 ns	0.0063 ns	0.85	-

WB_ephemeral	HEAD	5.467 ns	0.0790 ns	1.00	0.0004
WB_ephemeral	PR	5.409 ns	0.0538 ns	0.99	0.0003

a74nh · 2025-04-07T11:02:15Z

Is there any additional testing anyone wanted?

Maoni0 · 2025-04-14T22:57:43Z

I'm back from vacation and have asked @a74nh to please edit the original description of this PR to include a summary of the perf results so we'll have an easier time to know the perf behavior (instead of having to read many comments on the PR).

a74nh · 2025-04-17T13:33:52Z

Test Results

This comment will be extended as I gather results. This contains more details for the perf results in the top message. I intend to keep this comment up to date with the latest results

All run on an 8 core Cobalt 100, Ubuntu 24.04.2

Ephemeral test (dotnet/performance)

Method	Toolchain	Mean	Error	Ratio	Gen0
WB_nonephemeral	HEAD	3.742 ns	0.0046 ns	1.00	-
WB_nonephemeral	PR	4.489 ns	0.0077 ns	1.20	-

WB_ephemeral	HEAD	5.314 ns	0.1318 ns	1.00	0.0004
WB_ephemeral	PR	6.176 ns	0.0531 ns	1.16	0.0003

With DOTNET_GCWriteBarrier=3

Method	Toolchain	Mean	Error	Ratio	Gen0
WB_nonephemeral	HEAD	3.742 ns	0.0077 ns	1.00	-
WB_nonephemeral	PR	3.168 ns	0.0063 ns	0.85	-

WB_ephemeral	HEAD	5.467 ns	0.0790 ns	1.00	0.0004
WB_ephemeral	PR	5.409 ns	0.0538 ns	0.99	0.0003

GCPerfsim

Flags: -tc 2 -tagb 200 -tlgb 2 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0

No environment variables set (bit region write barriers):
AverageGen0PauseTimeDiffPercentage -21.06%
AverageGen1PauseTimeDiffPercentage -14.25%
AverageGen0Count: 2624 -> 2744
AverageGen1Count: 680 -> 673

DOTNET_GCWriteBarrier=2 (byte region write barriers):
AverageGen0PauseTimeDiffPercentage -6.7%
AverageGen1PauseTimeDiffPercentage -2.78%
AverageGen0Count: 3048 -> 3044
AverageGen1Count: 659 -> 659

DOTNET_GCWriteBarrier=3 (server write barriers):
AverageGen0PauseTimeDiffPercentage -1.37%
AverageGen1PauseTimeDiffPercentage -1.26%
AverageGen0Count: 3047 -> 3048
AverageGen1Count: 660 -> 658

DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
AverageGen0PauseTimeDiffPercentage -7.24%
AverageGen1PauseTimeDiffPercentage -3.49%
AverageGen0Count: 239 -> 239
AverageGen1Count: 81 -> 81

Flags: -tc 2 -tagb 200 -tlgb 8 -lohpi 0 -sohsi 50 -ramb 20 -rlmb 0.2 -sohpi 0

No environment variables set (bit region write barriers):
AverageGen0PauseTimeDiffPercentage -13.69%
AverageGen1PauseTimeDiffPercentage -5.7%
AverageGen0Count: 2957 -> 2957
AverageGen1Count: 750 -> 749

DOTNET_GCWriteBarrier=2 (byte region write barriers):
AverageGen0PauseTimeDiffPercentage -5.94%
AverageGen1PauseTimeDiffPercentage -1.19%
AverageGen0Count: 2958 -> 2959
AverageGen1Count: 749 -> 749

DOTNET_GCWriteBarrier=3 (server write barriers):
AverageGen0PauseTimeDiffPercentage +0.07%
AverageGen1PauseTimeDiffPercentage 0.00%
AverageGen0Count: 2960 -> 2957
AverageGen1Count: 748 -> 750

DOTNET_gcServer=1 DOTNET_GCHeapCount=8:
AverageGen0PauseTimeDiffPercentage -7.4%
AverageGen1PauseTimeDiffPercentage -3.04%
AverageGen0Count: 233 -> 233
AverageGen1Count: 81 -> 81

Orchard CMS benchmark

HEAD:
Requests/sec: 5171.91
Requests/sec: 5201.02
Requests/sec: 5235.64

PR:
Requests/sec: 5326.45
Requests/sec: 5309.99
Requests/sec: 5298.49

docs/design/coreclr/jit/GC-write-barriers.md

Maoni0 · 2025-04-23T19:54:20Z

@a74nh and I have been looking at the profiles and we need to do a new run as the runs from before was doing mostly gen1 GCs and there were very few gen0 GCs which made the comparison not meaningful. we did notice some problem with the runs @a74nh did where the BGC pause times were much higher with the fix build which I was going to take a look at.

a74nh · 2025-04-25T19:44:16Z

@a74nh and I have been looking at the profiles and we need to do a new run as the runs from before was doing mostly gen1 GCs and there were very few gen0 GCs which made the comparison not meaningful. we did notice some problem with the runs @a74nh did where the BGC pause times were much higher with the fix build which I was going to take a look at.

The higher pause times were due to issues in the way the results were being gathered, which has now been fixed.

New runs of the GCperfSim have been done with a meaningful number of GC collections.

Full results here: #111636 (comment)

The best result is -21.06% Gen0 pause time and -14.25% gen1 pause time.

Meanwhile, GCWriteBarrier=3 is showing now change from head (as we wanted).

A reduced version is in the top comment.

Maoni0

thanks so much, @a74nh, for your contribution and being patient with the perf data collection, discussion and meetings at odd hours :) this work is greatly appreciated!

Change-Id: Ia4f89dce9cb5aeedeeac16e54b7e35e9f255f68b

Arm64: Implement region write barriers

db6c2cf

ghost added the area-VM-coreclr label Jan 20, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 20, 2025

a74nh added 2 commits January 21, 2025 11:31

Fix byte region barriers

5de7e0f

Fix bit region barriers

9315aa1

EgorBo reviewed Jan 21, 2025

View reviewed changes

src/coreclr/vm/arm64/patchedcode.S Outdated Show resolved Hide resolved

jkotas reviewed Jan 21, 2025

View reviewed changes

src/coreclr/vm/arm64/patchedcode.S Outdated Show resolved Hide resolved

a74nh added 2 commits January 22, 2025 10:29

test instead of cmp for bitwise write barriers

d0e46f0

use LSE to atomically update bitwise write barriers

cb83f53

jkotas reviewed Jan 22, 2025

View reviewed changes

src/coreclr/vm/arm64/asmhelpers.S Outdated Show resolved Hide resolved

a74nh added 2 commits January 23, 2025 09:53

move atomics check into gcenv.ee.cpp

c615772

Skip ephemeral checks for regionless server GC, and refactor checks

1c865f1

EgorBo reviewed Jan 23, 2025

View reviewed changes

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved

EgorBot mentioned this pull request Jan 23, 2025

EgorBot for EgorBo in #111636 EgorBot/runtime-utils#247

Open

kunalspathak added the arch-arm64 label Jan 23, 2025

jkotas reviewed Jan 23, 2025

View reviewed changes

src/coreclr/vm/gcenv.ee.cpp Outdated Show resolved Hide resolved

Move ephemeral checks back

15dde1b

EgorBot mentioned this pull request Mar 20, 2025

Benchmarks for #111636 (a74nh) EgorBot/runtime-utils#324

Open

Merge main

1c80ab8

Add WRITE_BARRIER_CHECK and LSE atomics to the doc

aefc2ef

merge main

8e0dbe4

jkotas reviewed Apr 21, 2025

View reviewed changes

docs/design/coreclr/jit/GC-write-barriers.md Outdated Show resolved Hide resolved

jkotas reviewed Apr 21, 2025

View reviewed changes

docs/design/coreclr/jit/GC-write-barriers.md Outdated Show resolved Hide resolved

a74nh added 2 commits April 22, 2025 12:02

doc typo fixes

8d164ef

Merge main

8b02ab4

kunalspathak mentioned this pull request May 14, 2025

Improve Arm64 Performance in .NET 10 #109652

Open

17 tasks

Merge main

328024e

a74nh mentioned this pull request May 16, 2025

Write barrier without any RWX pages #114982

Merged

Maoni0 approved these changes May 16, 2025

View reviewed changes

Maoni0 enabled auto-merge (squash) May 16, 2025 23:20

Merge main

801b9cc

Change-Id: Ia4f89dce9cb5aeedeeac16e54b7e35e9f255f68b

Maoni0 merged commit e2ad5fc into dotnet:main May 17, 2025
96 checks passed

a74nh deleted the precisewritebarriers_github branch May 17, 2025 08:46

LoopedBard3 mentioned this pull request May 22, 2025

[Perf] Linux/arm64: 2 Regressions on 5/19/2025 8:58:33 PM +00:00 #115903

Closed

github-actions bot locked and limited conversation to collaborators Jun 17, 2025

Arm64: Implement region write barriers #111636

Arm64: Implement region write barriers #111636

Uh oh!

Conversation

a74nh commented Jan 20, 2025 • edited by Maoni0 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Ephemeral test (dotnet/performance)

GCPerfsim

Orchard CMS benchmark

Uh oh!

Uh oh!

Uh oh!

kunalspathak commented Jan 21, 2025 • edited by mrsharm Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a74nh commented Jan 21, 2025

Uh oh!

Uh oh!

EgorBo commented Jan 23, 2025

Uh oh!

a74nh commented Jan 23, 2025

Uh oh!

Uh oh!

EgorBo commented Jan 23, 2025

Uh oh!

EgorBo commented Jan 23, 2025

Uh oh!

EgorBo commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mrsharm commented Jan 24, 2025

Uh oh!

cshung commented Jan 24, 2025

Uh oh!

a74nh commented Jan 27, 2025

Uh oh!

a74nh commented Mar 20, 2025

Uh oh!

a74nh commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a74nh commented Mar 21, 2025

Uh oh!

a74nh commented Mar 21, 2025

Uh oh!

jkotas commented Mar 23, 2025

Uh oh!

a74nh commented Mar 26, 2025

Uh oh!

a74nh commented Apr 7, 2025

Uh oh!

a74nh commented Apr 7, 2025

Uh oh!

Maoni0 commented Apr 14, 2025

Uh oh!

a74nh commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Ephemeral test (dotnet/performance)

GCPerfsim

Orchard CMS benchmark

Uh oh!

Uh oh!

Uh oh!

Maoni0 commented Apr 23, 2025

Uh oh!

a74nh commented Apr 25, 2025

Uh oh!

Maoni0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

a74nh commented Jan 20, 2025 •

edited by Maoni0

Loading

kunalspathak commented Jan 21, 2025 •

edited by mrsharm

Loading

EgorBo commented Jan 23, 2025 •

edited

Loading

a74nh commented Mar 20, 2025 •

edited

Loading

a74nh commented Apr 17, 2025 •

edited

Loading