Skip to content

Return buffer write barrier trade-off #111127

Closed
@NinoFloris

Description

@NinoFloris

Methods returning struct types do so via the stack, writing the struct to the hidden return buffer reference passed by the caller. As this reference is opaque to the callee the JIT must emit write barriers *if the struct contains references* in case it points to the heap. Though as long as the JIT manages to inline the callee - if it's not out of budget, blocked by virtual calls or EH etc. - it is able to elide these barriers when it can see the return buffer reference is definitely pointing to the stack.

My theory is that the current tradeoff to treat hidden return buffer references as managed is the wrong one for modern .NET code, where returning structs (including those containing references) has become much more common. It's an essential tool for high performance and low allocation code.

Any sort of 'good practice' code structuring, separating concerns for clarity or introducing an abstraction boundary can easily introduce new stack frames. This adds write barrier costs for each additional frame, assuming some progressive handling of the same result is performed. Network protocols are a good example where it's natural to want high performance, no allocation per message, and progressive handling of concerns (framing, parsing, handling, out-of-band handling, etc.) spread across dependent methods, causing successive barrier costs without touching the heap. Profiling this on arm64 - where write barriers are already more expensive (tracked in #109652) - on messages that are simple to parse and commonly small in size, write barriers accounted for an unfortunate portion of the total time spent.

Nested opaque IEnumerables are another good example where IEnumerator<T>.Current calls can cause successive barriers costs. As I understand it PGO guarded devirt is not a huge help here, it does concrete type tracking not per call-site but instead per type across all virtual calls involving it. For megamorphic interfaces like IEnumerable it's extremely unlikely the most globally common type(s) - which the JIT could then lead down the optimized inlined path - will be selected more than once in a nested enumerator call stack.

I've gathered that return buffer references are only managed to allow caller code to directly refer to a field stored on the heap. If this could no longer be optimized (i.e. require a stack to heap copy) but the trade-off would be that all returns no longer introduce write barriers, my expectation would be that this leads to an overall performance improvement.

If it turns out to be a wash the trade-off might still be beneficial for predictability. It was surprising to me (and others /cc @neon-sunset #92662) that stack only code involves write barriers at all. It's an unobvious performance surprise when this kind of code is structured and written in the obvious way. Much more so than an additional copy for a heap field assignment from a callee return value would surprise me, as such code visibly communicates it operates across the heap and stack.

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMItenet-performancePerformance related issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions