Skip to content

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

Open
@BruceForstall

Description

@BruceForstall

During the original development of the arm32 product, it was decided that the lr register would always be stored to the stack in the prolog to support return address hijacking for GC suspension. There is this comment in the JIT (CodeGen::genPushCalleeSavedRegisters()):

// It may be possible to skip pushing/popping lr for leaf methods. However, such optimization would require
// changes in GC suspension architecture.
//
// We would need to guarantee that a tight loop calling a virtual leaf method can be suspended for GC. Today, we
// generate partially interruptible code for both the method that contains the tight loop with the call and the leaf
// method. GC suspension depends on return address hijacking in this case. Return address hijacking depends
// on the return address to be saved on the stack. If we skipped pushing/popping lr, the return address would never
// be saved on the stack and the GC suspension would time out.
//
// So if we wanted to skip pushing pushing/popping lr for leaf frames, we would also need to do one of
// the following to make GC suspension work in the above scenario:
// - Make return address hijacking work even when lr is not saved on the stack.
// - Generate fully interruptible code for loops that contains calls
// - Generate fully interruptible code for leaf methods
//
// Given the limited benefit from this optimization (<10k for mscorlib NGen image), the extra complexity
// is not worth it.

This decision was maintained when arm64 support was added.

Should this decision be reconsidered?

For arm64, empty methods have this minimum code size:

stp     fp, lr, [sp,#-16]!
mov     fp, sp
ldp     fp, lr, [sp],#16
ret     lr

Question: if function A was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?

Even simple leaf methods require this prolog/epilog, e.g.:

G_M1350_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp

G_M1350_IG02:
        F9401400          ldr     x0, [x0,#40]

G_M1350_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

One argument to keep this is that simple leaf methods are likely to be inlined into their callers and therefore this prolog/epilog overhead isn't encountered in real programs.

The overhead measurement (<10k in the above comment for mscorlib => System.Private.CoreLib) could be recomputed for arm64 and the current situation, and include measurement of other libraries as well.

There might also be implications to not saving and establishing fp on debugging, stack walking, etc.

Comments? @jkotas @AndyAyersMS @kunalspathak

category:question
theme:prolog-epilog
skill-level:expert
cost:large

Metadata

Metadata

Assignees

No one assigned

    Labels

    JitUntriagedCLR JIT issues needing additional triagearch-arm64area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions