Description
During the original development of the arm32 product, it was decided that the lr
register would always be stored to the stack in the prolog to support return address hijacking for GC suspension. There is this comment in the JIT (CodeGen::genPushCalleeSavedRegisters()
):
// It may be possible to skip pushing/popping lr for leaf methods. However, such optimization would require
// changes in GC suspension architecture.
//
// We would need to guarantee that a tight loop calling a virtual leaf method can be suspended for GC. Today, we
// generate partially interruptible code for both the method that contains the tight loop with the call and the leaf
// method. GC suspension depends on return address hijacking in this case. Return address hijacking depends
// on the return address to be saved on the stack. If we skipped pushing/popping lr, the return address would never
// be saved on the stack and the GC suspension would time out.
//
// So if we wanted to skip pushing pushing/popping lr for leaf frames, we would also need to do one of
// the following to make GC suspension work in the above scenario:
// - Make return address hijacking work even when lr is not saved on the stack.
// - Generate fully interruptible code for loops that contains calls
// - Generate fully interruptible code for leaf methods
//
// Given the limited benefit from this optimization (<10k for mscorlib NGen image), the extra complexity
// is not worth it.
This decision was maintained when arm64 support was added.
Should this decision be reconsidered?
For arm64, empty methods have this minimum code size:
stp fp, lr, [sp,#-16]!
mov fp, sp
ldp fp, lr, [sp],#16
ret lr
Question: if function A
was a loop with a lot of expensive computation (say, 1000 divides) and a single call to a trivial function (that is fully interruptible?), then the expensive loop is partially interruptible due to the call. But there will be only one instruction in the call that is a GC safe point (or maybe two?). Isn't GC starvation likely in this scenario?
Even simple leaf methods require this prolog/epilog, e.g.:
G_M1350_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
G_M1350_IG02:
F9401400 ldr x0, [x0,#40]
G_M1350_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
One argument to keep this is that simple leaf methods are likely to be inlined into their callers and therefore this prolog/epilog overhead isn't encountered in real programs.
The overhead measurement (<10k in the above comment for mscorlib => System.Private.CoreLib) could be recomputed for arm64 and the current situation, and include measurement of other libraries as well.
There might also be implications to not saving and establishing fp
on debugging, stack walking, etc.
Comments? @jkotas @AndyAyersMS @kunalspathak
category:question
theme:prolog-epilog
skill-level:expert
cost:large