[arm/arm64] Leaf frames, saving LR, and return address hijacking

During the original development of the arm32 product, it was decided that the `lr` register would always be stored to the stack in the prolog to support return address hijacking for GC suspension. There is this comment in the JIT (`CodeGen::genPushCalleeSavedRegisters()`):
```
// It may be possible to skip pushing/popping lr for leaf methods. However, such optimization would require
// changes in GC suspension architecture.
//
// We would need to guarantee that a tight loop calling a virtual leaf method can be suspended for GC. Today, we
// generate partially interruptible code for both the method that contains the tight loop with the call and the leaf
// method. GC suspension depends on return address hijacking in this case. Return address hijacking depends
// on the return address to be saved on the stack. If we skipped pushing/popping lr, the return address would never
// be saved on the stack and the GC suspension would time out.
//
// So if we wanted to skip pushing pushing/popping lr for leaf frames, we would also need to do one of
// the following to make GC suspension work in the above scenario:
// - Make return address hijacking work even when lr is not saved on the stack.
// - Generate fully interruptible code for loops that contains calls
// - Generate fully interruptible code for leaf methods
//
// Given the limited benefit from this optimization (<10k for mscorlib NGen image), the extra complexity
// is not worth it.
```

This decision was maintained when arm64 support was added.

Should this decision be reconsidered?

For arm64, empty methods have this minimum code size:
```
stp     fp, lr, [sp,#-16]!
mov     fp, sp
ldp     fp, lr, [sp],#16
ret     lr
```

**Question**: if function `A` was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?

Even simple leaf methods require this prolog/epilog, e.g.:
```
G_M1350_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp

G_M1350_IG02:
        F9401400          ldr     x0, [x0,#40]

G_M1350_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
```

One argument to keep this is that simple leaf methods are likely to be inlined into their callers and therefore this prolog/epilog overhead isn't encountered in real programs.

The overhead measurement (<10k in the above comment for mscorlib => System.Private.CoreLib) could be recomputed for arm64 and the current situation, and include measurement of other libraries as well.

There might also be implications to not saving and establishing `fp` on debugging, stack walking, etc.

Comments? @jkotas @AndyAyersMS @kunalspathak 

category:question
theme:prolog-epilog
skill-level:expert
cost:large

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[arm/arm64] Leaf frames, saving LR, and return address hijacking #35274

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions