Description
As a reminder: when compiling the interpreter using computed gotos, we emit a separate copy of the "dispatch" jump (currently goto *opcode_targets[opcode]
) for each opcode implementation. The goal here -- as opposed to dispatching once at the top of the loop, and jumping to the top of the loop after each instruction -- is to expose more information to the hardware branch predictor (specifically, I believe, the branch target predictor, which tries to guess the destination of indirect branches).
However, the C compiler doesn't know that! It does, however, see that we've given it a C function containing a few hundred instances of identical DISPATCH
code … and may choose to merge them together, replacing one or more instances with a jump into the tail end of a different opcode, thus undoing our careful work!
I suspect we can find a way to prevent the compiler from merging these branches, and thus restore similar branch-target-prediction behavior as in the tail-call interpreter.
The branch above attempts to prevent this merging by adding an empty __asm__ volatile
"optimization barrier" to each call site, in the hopes that the compiler treats this as opaque and refuses to merge them. I'm far from certain this is the best approach, but in my testing it seems to be a speedup -- I see 1.03x on pyperformance
on my machine (goto-mine
is the linked branch with the optimization barrier).
We can also observe the merging a bit more directly by counting the number of DISPATCH
sites that remain after optimization, according to debug symbols:
main
$ objdump -S --disassemble=_PyEval_EvalFrameDefault Python/ceval.o | grep -cF 'DISPATCH()'
47
nelhage/computed-goto-nomerge
$ objdump -S --disassemble=_PyEval_EvalFrameDefault Python/ceval.o | grep -cF 'DISPATCH()'
306
For comparison, generated_cases.c.h
contains 227 instances on the same commit.