-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64: Implement region write barriers #111636
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @mangod9 |
I also have a bunch of notes where I rewrote the AMD64 and ARM64 write barrier assembly in pseudo code. I'll tidy up and add somewhere in docs/ |
@a74nh I'm just curious, is this ready for benchmarks? (on linux-arm64) |
I think all the failures are fixed up now. So, yes, this would be a good time. If you've got something to run that'd be great. I've been using your |
Afair it's not bottle-necked in Write-Barrier + presumably, your PR is supposed to decrease average GC pause rather than WB's throughput? So you might want to look at the GC stats? the |
@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler using BenchmarkDotNet.Attributes;
public class MyBench
{
object Dst1;
object Dst2;
object Dst3;
object Dst4;
static object Value = new();
[Benchmark]
public void WB_nonephemeral()
{
// Write non-ephemeral reference
Dst1 = Value;
Dst2 = Value;
Dst3 = Value;
Dst4 = Value;
}
[Benchmark]
public void WB_ephemeral()
{
// Write non-ephemeral reference
Dst1 = new object();
}
} |
I guess it's sort of expected that it's slower throughput wise in microbenchmarks. the // Check whether the region we're storing into is gen 0 - nothing to do in this case
ldrb w12, [x12]
cbz w12, LOCAL_LABEL(Exit) (I guess I should've added an extra benchmark where object we're storing is gen2) PS: feel free to call the bot yourself if needed |
Sorry for the delay. I would run the microbenchmarks with and without this change on the pertinent hardware on the following tests given below for a sufficient number of iterations (as some of these exhibit a considerable amount of variance). The other considerations while running these is to ensure that the number of GCs is equivalent between the baseline and the comparand - this can be achieved by:
Once the microbenchmarks are run, the pertinent metrics would be the % difference in the time of execution of a test + the standard error of tests. As a note: the following for the regression that was created because of us moving to a More Precise Write Barrier for x64: #73783 - seems like one of the affected microbenchmarks is already in the aforementioned list. I remember |
As we run the benchmarks, I would pay attention to ephemeral GC pause time, in particular the time spent on marking cards. |
@EgorBot -linux_azure_cobalt100 -linux_azure_ampere -profiler using BenchmarkDotNet.Attributes;
public class MyBench
{
object Dst1;
object Dst2;
object Dst3;
object Dst4;
static object Value = new();
[Benchmark]
public void WB_nonephemeral()
{
// Write non-ephemeral reference
Dst1 = Value;
Dst2 = Value;
Dst3 = Value;
Dst4 = Value;
}
[Benchmark]
public void WB_ephemeral()
{
// Write non-ephemeral reference
Dst1 = new object();
}
} |
Running the tests from earlier. Cobalt 100 with 8 cores, 32Gb. For non-ephermeral, the performance drops by 30% (the same as what EgorBot had seen). But then implementing the writebarrier manager, that drop halves. Ephemeral has a 7% drop. Adding the writebarrier manager doesn't really effect that.
I ran the test again, but this time for the writebarriermanager, I forced it to never used the region barriers (by removing the g_region_shr check in the writebarriermanager). Now the for WB_nonephemeral, the performance is slightly faster when using writebarriermanager. Ephemeral has as an 8% speed up too. Given the drop on the original PR too vs HEAD, I'm happy to discount the ephemeral differences as within noise.
I'm happy with these results as region write barriers will always introduce some additional cost. This is coupled with the decrease in GC pause times (shown by the GC perf test here). Is that good enough for this PR to continue? (I still need to fix windows/OSX builds and do some additional functionality testing.) |
I wonder if the Allowing this will help us with mitigating the performance regression risk since then we can just advise to change the configuration settings as needed. // From gcconfig.h
enum WriteBarrierFlavor
{
WRITE_BARRIER_DEFAULT = 0,
WRITE_BARRIER_REGION_BIT = 1,
WRITE_BARRIER_REGION_BYTE = 2,
WRITE_BARRIER_SERVER = 3,
}; |
Yes, I get the same results when setting
Agreed, that makes sense. I'm guessing that is already suggested for X64 users. |
Had to do some reworking to work on MacOS, as MacOS does not allow Just Windows left to fix up. |
// Check this is going from old to young | ||
if reg_loc_dst >= reg_loc_ref: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we double check the comparsion direction here? I think the correct direction should be reversed.
// Check this is going from old to young | |
if reg_loc_dst >= reg_loc_ref: | |
return | |
// Return if the new reference is not from old to young | |
if reg_loc_ref >= reg_loc_dst: | |
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we double check the comparsion direction here? I think the correct direction should be reversed.
Yes, your version is correct. Fixed and updated in assembly files too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your version is correct. Fixed and updated in assembly files too.
Thanks for confirming. I notice the changes you pushed is just about the comments. How about the assembly code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assembly code matches the comments :)
As discussed offline, will be good to do a comparison of workstation vs. server GC impact in pause time and performance numbers. |
With
With
With
With
As I hoped, GC server is giving similar results to GC workstation. This makes sense as server and workstation will be using the same writebarrier when GCWriteBarrier=0, and different but short versions when GCWriteBarrier=3 |
Extend the Arm64 writebarrier function to support regions. The assembly is updated similar to that for AMD64.
This is expected to make the writebarrier slower, but improve the performance of the GC.