-
-
Notifications
You must be signed in to change notification settings - Fork 31.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-128563: A new tail-calling interpreter #128718
Conversation
Co-Authored-By: Garrett Gu <[email protected]>
This reverts commit b9bedb1.
This reverts commit 982c51d.
No, but the benchmarking infra can't benchmark this anyways. Because it's opt-in. @mdboom perhaps we could add the configure option now? |
@markshannon we can't do the |
Great work on this, @Fidget-Spinner! |
Co-authored-by: Garrett Gu <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Hugo van Kemenade <[email protected]>
Co-authored-by: Garrett Gu <[email protected]> Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Hugo van Kemenade <[email protected]>
do these performance improvements affect only eval() or any execution of pyc code? |
The pyc code uses the interpreter too. So any pyc code. |
will it be available for Windows users in coming Python-3.14.0a5 ? alphas are for breaking things |
I will add a Windows build option in a follow-up PR, but not in time for a5 I believe. Also, you'd need the clang-cl backend in MSBuild instead of MSVC to get this working. I am currently trying to persuade CPython to move over to clang-cl as it seems there should be no ABI breakage and better performance. faster-cpython/ideas#690 (comment) |
When compiling the computed-goto interpreter, every opcode implementation ends with an identical chunk of code, generated by the `DISPATCH()` macro. In some cases, the compiler is able to notice this, and replaces the code in one or more opcodes with a jump into the tail portion of a different opcode. However, we specifically **don't** want that to happen; the entire premise of using computed gotos is to lift more information into the instruction pointer in order to give the hardware branch-target- predictor more information to work with! In my preliminary tests, this tail-merging of opcode implementations explains most of the performance improvement of the new tail-call interpreter (python#128718) -- compilers are much less willing to merge code across functions, and so the tail-call interpreter preserves all (or at least more) of the individual `DISPATCH` sites. This change attempts to prevent the merging of `DISPATCH` calls, by adding an (empty) `__asm__ volatile`, which acts as an opaque barrier to the optimizer, preventing it from considering all of these sequences as identical.
Posting here for visibility: I've been continuing to chase down the LLVM regression I identified in #129987. I've run benchmarks on both Intel Raptor Lake and Apple M1 hardware, comparing clang18, clang19, clang19+tailcalls, and clang19 with the regression worked around ("clang19.taildup" -- I'm using a On my environment, I find that the primary benefit of the tail-call interpreter comes from reversing the LLVM 19 regression; contrary to my earlier results, I find the regression ends up costing around 10% performance on both platforms(!). I do see a 1-2% win, which is still impressive, although there are a number of sources of potential noise. Here's my headline results:
All builds use LTO and PGO. I've posted my benchmarking setup, including I want to be clear that even if this is right, I still think this is great work, and expect that the tail-call interpreter is in many ways a more robust approach, and has additional headroom for optimization. I just happened to stumble on something that didn't quite make sense to me, and doggedly do my best to run it down… |
@nelhage thanks for all the investigation. I'm surprised that on modern hardware, computed gotos make a 10% difference. I was under the impression that modern literature suggests more of a 2-3% range. Perhaps the LLVM 19 bug is doing more than just tail cse? In any case, I will try disabling computed gotos altogether and run it on the Faster CPython benchmarking machine. I plan to put up a notice anyways on the whats new saying that the perf numbers are inaccurate due to the LLVM bug. |
@nelhage since our benchmarking infrastructure isnt as flexible as yours, would applying your original asm volatile patch be equivalent to fixing the tailduplicator on llvm 19? I could apply it in our regex engine too, as its the only other place computed gotos are used. I plan to bench it. |
I'm also surprised! I would love someone to reproduce independently because I am concerned my setup has somehow made a systemic error I'm not seeing. I did so many benchmarks in part to try to ensure they all paint a consistent picture. I think the clearest evidence I have is the "clang19.taildup" numbers. Those are generated by configuring using ./configure [other flags] \
"OPT=-g -O3 -Wall -mllvm -tail-dup-pred-size=5000" \
"LDFLAGS=-fuse-ld=lld -Wl,-mllvm -Wl,-tail-dup-pred-size=5000" ( The I see better speedup numbers for that flag than I do for my |
@nelhage I think the closest comparison we have is the results on the Faster CPython M1 machine, which uses Apple Clang (Which should be LLVM 17) and computed gotos versus Clang-19 with tailcalls https://github.com/faster-cpython/benchmarking-public (Look at the graph labelled "Effect of build with latest clang and tailcall vs Tier 1). Before our PGO bug that artificially boosted perf again, the perf gain for tailcalling was only 5%, versus the 15% reported in Clang 19 base. So I believe the real speedup is in the 5% range, which corresponds roughly to your results. I will advocate to the team to updating the benchmarking results with the numbers of GCC and Xcode clang 17 as baseline, which means a 3-5% speedup, not 10% speedup. I will also edit all posts/comments/issues that I've made to warn users to take the numbers with a grain of salt, due to the LLVM bug. I will ask for consensus from the team first. |
Yep, my numbers seem broadly consistent with a 3-5% improvement. I'm totally happy to let you and the team decide how much update to messaging where is appropriate. I've got a draft blog post I hope to release within a week or so just because I find this interesting (and an interesting case study about how tricky benchmarking is!); I'll try to shoot you a draft before go-live to make sure it feels fair and accurate. |
@nelhage could I trouble you to rerun the benchmarks with clang 20 please and with the patches for computed gotos applied please? It just released yesterday, and I'm wondering if the tailcall performs better with clang 20. |
FWIW, here again is the clang-cl data for the PGO Windows builds I did during #129907 from here https://gist.github.com/chris-eibl/114a42f22563956fdb5cd0335b28c7ae, but this time compared against 18.1.8. 64bit pyperformance results on my Windows 10 PC (i5-4570 CPU) run with
I think this fits your findings so far:
Big table Details
Only clang.pgo.cg.18.1.8.9db1a297d9 vs clang.pgo.tc.20.1.0.rc2.9db1a297d9 using Details
|
I tested with a pre-release clang 20 rc and saw comparable results to clang 19 (for both computed goto and tail-call). I'll see if I can test the 20.1 release once it lands in nixpkgs. llvm/llvm-project#114990 is the PR that fixes the LLVM regression and it hasn't landed yet. I did test clang19 + that patch, and saw comparable performance to the |
@chris-eibl Amazing, thanks for reposting/analyzing those. I agree those seem consistent with my results. Interesting that the bugs makes computed gotos slower than the |
Okay, this is pretty entertaining to me. It appears -- at least in my environment -- that clang18 is also able to tail-duplicate the dispatch even without
I wonder how longer that's been true; it certainly complicates experiments to try to demonstrate the performance advantage of duplicating the dispatch, if the compiler will sometimes do it for you anyways. clang 19 fails to tail-duplicate, just like it does with computed goto:
That does leave me confused why @chris-eibl saw much better performance with clang19 without computed gotos. I'm running that benchmark on my own environment to replicate, but at a glance I notice that clang19 without computed gotos manages to tail-duplicate more of the dispatch logic, and maybe that helps the pipeline out somehow. e.g.: Merged tail on clang19, with computed gotos:
Merged tail on clang19, without computed gotos:
|
It replicates on my machine, more-or-less:
So clang19 must be doing something pathological to the computed gotos, beyond just the failure to tail-merge. |
Features:
Preliminary benchmark results here https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250107-3.14.0a3+-f1d3190-CLANG
TLDR (all results are pyperformance, clang-19, with PGO + ThinLTO unless stated otherwise):
More recent benchmark results:
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250116-3.14.0a4+-df5d01c-CLANG
This initial implementation focuses on correctness. There's still room to improve performance even further. I've detailed performance plans in the original issue.
CORRECTION NOTICE: We've since found a compiler bug in LLVM 19 that artificially boosted the new interpreter's numbers. The numbers are closer to geomean 3-5% speedup. I apologize for reporting incorrect figures previously due to the compiler bug.
Changset:
configure.ac
to auto-detect when we can do this.opcode
in the call arguments because it might be modified by instrumented instructions.Credits also to Brandt and Savannah for the JIT workflow file.