Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128563: A new tail-calling interpreter #128718

Merged
merged 127 commits into from
Feb 6, 2025

Conversation

Fidget-Spinner
Copy link
Member

@Fidget-Spinner Fidget-Spinner commented Jan 10, 2025

Features:

  • Significantly better performance on all 64-bit platforms that we care about.
  • Better debugging experience, locals don't get optimized out anymore in tail call handlers. If you want to see a trace of instructions, you can just disable tail calls too.

Preliminary benchmark results here https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250107-3.14.0a3+-f1d3190-CLANG

TLDR (all results are pyperformance, clang-19, with PGO + ThinLTO unless stated otherwise):

  • 9.2% geomean faster AArch64 Ubuntu 22.04 ARM Neoverse N1. Up to 26% faster on Python workloads.
  • 7.4% geomean faster x86_64 Ubuntu 20.04 with Xeon W-2255. Up to 24% faster on Python workloads.
  • 10.8% geomean faster on x86_64 Windows 64-bit with i9-12900. Up to 37% faster on Python workloads. [No PGO]
  • 14.4% geomean slower on x86 Windows 32-bit with i9-12900 (same machine). Will turn off this option.
  • 14.7% geomean faster on AArch64 with macOS M1. Up to 45% faster on Python workloads.

More recent benchmark results:
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250116-3.14.0a4+-df5d01c-CLANG

  • 8.5% geomean faster AArch64 with Ubuntu 22.04 ARM Neoverse N1. Up to 24% faster on Python workloads.
  • 9.3% geomean faster x86_64 Ubuntu 22.04 with i9-12900. Up to 24% faster on Python workloads.
  • 11.7% geomean faster on AArch64 with macOS M1. Up to 49% faster on Python workloads.

This initial implementation focuses on correctness. There's still room to improve performance even further. I've detailed performance plans in the original issue.

CORRECTION NOTICE: We've since found a compiler bug in LLVM 19 that artificially boosted the new interpreter's numbers. The numbers are closer to geomean 3-5% speedup. I apologize for reporting incorrect figures previously due to the compiler bug.

Changset:

  • Added to configure.ac to auto-detect when we can do this.
  • We need opcode in the call arguments because it might be modified by instrumented instructions.
  • Autogenerated opcode tailcall handlers using existing bytecode DSL generator.

Credits also to Brandt and Savannah for the JIT workflow file.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Feb 6, 2025

Do we have up to date performance numbers? No need to do that before merging, but it would be good to have the performance impact on record.

No, but the benchmarking infra can't benchmark this anyways. Because it's opt-in. @mdboom perhaps we could add the configure option now?

@Fidget-Spinner
Copy link
Member Author

@markshannon we can't do the int opcode plan because it breaks platforms without computed gotos it seems (like MSVC). So I'll just put an ifdef around it.

@Fidget-Spinner Fidget-Spinner merged commit cb640b6 into python:main Feb 6, 2025
70 checks passed
@Fidget-Spinner Fidget-Spinner deleted the tail-call branch February 6, 2025 15:22
@brandtbucher
Copy link
Member

Great work on this, @Fidget-Spinner!

srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Feb 7, 2025
Co-authored-by: Garrett Gu <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <[email protected]>
cmaloney pushed a commit to cmaloney/cpython that referenced this pull request Feb 8, 2025
Co-authored-by: Garrett Gu <[email protected]>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Hugo van Kemenade <[email protected]>
@iperov
Copy link

iperov commented Feb 9, 2025

do these performance improvements affect only eval() or any execution of pyc code?

@Fidget-Spinner
Copy link
Member Author

do these performance improvements affect only eval() or any execution of pyc code?

The pyc code uses the interpreter too. So any pyc code.

@stonebig
Copy link

stonebig commented Feb 9, 2025

will it be available for Windows users in coming Python-3.14.0a5 ? alphas are for breaking things

@Fidget-Spinner
Copy link
Member Author

will it be available for Windows users in coming Python-3.14.0a5 ? alphas are for breaking things

I will add a Windows build option in a follow-up PR, but not in time for a5 I believe. Also, you'd need the clang-cl backend in MSBuild instead of MSVC to get this working. I am currently trying to persuade CPython to move over to clang-cl as it seems there should be no ABI breakage and better performance. faster-cpython/ideas#690 (comment)

nelhage added a commit to nelhage/cpython that referenced this pull request Feb 11, 2025
When compiling the computed-goto interpreter, every opcode
implementation ends with an identical chunk of code, generated by the
`DISPATCH()` macro. In some cases, the compiler is able to notice
this, and replaces the code in one or more opcodes with a jump into
the tail portion of a different opcode.

However, we specifically **don't** want that to happen; the entire
premise of using computed gotos is to lift more information into the
instruction pointer in order to give the hardware branch-target-
predictor more information to work with! In my preliminary tests, this
tail-merging of opcode implementations explains most of the
performance improvement of the new tail-call interpreter (python#128718) --
compilers are much less willing to merge code across functions, and so
the tail-call interpreter preserves all (or at least more) of the
individual `DISPATCH` sites.

This change attempts to prevent the merging of `DISPATCH` calls, by
adding an (empty) `__asm__ volatile`, which acts as an opaque barrier
to the optimizer, preventing it from considering all of these
sequences as identical.
@nelhage
Copy link

nelhage commented Mar 6, 2025

Posting here for visibility:

I've been continuing to chase down the LLVM regression I identified in #129987. I've run benchmarks on both Intel Raptor Lake and Apple M1 hardware, comparing clang18, clang19, clang19+tailcalls, and clang19 with the regression worked around ("clang19.taildup" -- I'm using a -mllvm tunable).

On my environment, I find that the primary benefit of the tail-call interpreter comes from reversing the LLVM 19 regression; contrary to my earlier results, I find the regression ends up costing around 10% performance on both platforms(!).

I do see a 1-2% win, which is still impressive, although there are a number of sources of potential noise.

Here's my headline results:

Platform clang18 clang19 clang19.taildup clang19.tc gcc
Raptor Lake i5-13500 (ref) 1.09x slower 1.01x faster 1.03x faster 1.02x faster
Apple M1 Macbook Air (ref) 1.12x slower 1.02x slower 1.00x slower N/A

All builds use LTO and PGO.

I've posted my benchmarking setup, including nix code that defines all of the Python builds I'm using, and some additional details. Benchmarking is very subtle and prone to many sources of confusion, so I'm very open to my results being misleading in some way; I'd love to see others reproduce the headline "clang18 vs clang19+goto vs clang19+tc" result!

I want to be clear that even if this is right, I still think this is great work, and expect that the tail-call interpreter is in many ways a more robust approach, and has additional headroom for optimization. I just happened to stumble on something that didn't quite make sense to me, and doggedly do my best to run it down…

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Mar 6, 2025

@nelhage thanks for all the investigation. I'm surprised that on modern hardware, computed gotos make a 10% difference. I was under the impression that modern literature suggests more of a 2-3% range. Perhaps the LLVM 19 bug is doing more than just tail cse?

In any case, I will try disabling computed gotos altogether and run it on the Faster CPython benchmarking machine. I plan to put up a notice anyways on the whats new saying that the perf numbers are inaccurate due to the LLVM bug.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Mar 6, 2025

@nelhage since our benchmarking infrastructure isnt as flexible as yours, would applying your original asm volatile patch be equivalent to fixing the tailduplicator on llvm 19?

I could apply it in our regex engine too, as its the only other place computed gotos are used.

I plan to bench it.

@nelhage
Copy link

nelhage commented Mar 6, 2025

I'm also surprised! I would love someone to reproduce independently because I am concerned my setup has somehow made a systemic error I'm not seeing. I did so many benchmarks in part to try to ensure they all paint a consistent picture.

I think the clearest evidence I have is the "clang19.taildup" numbers. Those are generated by configuring using

./configure [other flags] \
  "OPT=-g -O3 -Wall -mllvm -tail-dup-pred-size=5000" \
  "LDFLAGS=-fuse-ld=lld -Wl,-mllvm -Wl,-tail-dup-pred-size=5000"

(nix config: https://github.com/nelhage/cpython-interp-perf/blob/fd51b8014c2bd2933ac46fbe8c30c906e86effd0/python.nix#L70-L73)

The -tail-dup-pred-size is a tunable introduce in LLVM 19 that sets a threshold for tail duplication. Thus, I'm pretty sure the only change between "clang19" and "clang19.taildup" is enabling tail duplication. However, it's possible that enabling tail duplication somehow enables later LLVM passes to do additional optimizations or transforms -- I haven't compared the resulting assembly, yet (As an aside, note that, since I build with LTO, it's the linker that actually does codegen, which is why I need the -Wl,-mllvm to make sure the linker sees the flag, not just the compiler. Passing it only in CFLAGS/OPT has no effect with LTO, as far as I'm aware).

I see better speedup numbers for that flag than I do for my asm volatile patch; I'm not certain if that's because the -mllvm flag also speeds up the regex engine, because the asm volatile inhibits optimizations since Clang currently interprets it as a memory clobber, or some other reason. I think the -mllvm flag is a better experiment for a number of reasons, so I've mostly dropped the asm volatile hack in my current studies.

@Fidget-Spinner
Copy link
Member Author

Fidget-Spinner commented Mar 6, 2025

@nelhage I think the closest comparison we have is the results on the Faster CPython M1 machine, which uses Apple Clang (Which should be LLVM 17) and computed gotos versus Clang-19 with tailcalls

https://github.com/faster-cpython/benchmarking-public

(Look at the graph labelled "Effect of build with latest clang and tailcall vs Tier 1).

Before our PGO bug that artificially boosted perf again, the perf gain for tailcalling was only 5%, versus the 15% reported in Clang 19 base.

So I believe the real speedup is in the 5% range, which corresponds roughly to your results. I will advocate to the team to updating the benchmarking results with the numbers of GCC and Xcode clang 17 as baseline, which means a 3-5% speedup, not 10% speedup.

I will also edit all posts/comments/issues that I've made to warn users to take the numbers with a grain of salt, due to the LLVM bug.

I will ask for consensus from the team first.

@nelhage
Copy link

nelhage commented Mar 6, 2025

Yep, my numbers seem broadly consistent with a 3-5% improvement.

I'm totally happy to let you and the team decide how much update to messaging where is appropriate. I've got a draft blog post I hope to release within a week or so just because I find this interesting (and an interesting case study about how tricky benchmarking is!); I'll try to shoot you a draft before go-live to make sure it feels fair and accurate.

@Fidget-Spinner
Copy link
Member Author

@nelhage could I trouble you to rerun the benchmarks with clang 20 please and with the patches for computed gotos applied please? It just released yesterday, and I'm wondering if the tailcall performs better with clang 20.

@chris-eibl
Copy link
Contributor

FWIW, here again is the clang-cl data for the PGO Windows builds I did during #129907 from here https://gist.github.com/chris-eibl/114a42f22563956fdb5cd0335b28c7ae, but this time compared against 18.1.8.

64bit pyperformance results on my Windows 10 PC (i5-4570 CPU) run with --fast --affinity 0 for commit 9db1a29 with

  • Microsoft Visual Studio 2022 17.13.0 Preview 5.0 and different versions of clang
  • All builds are LTO + PGO
  • cg: computed gotos
  • tc: tail call
Benchmark 18.1.8 18.1.8 cg 19.1.1 19.1.1 cg 19.1.1 tc 20.1.0.rc2 tc
Geometric mean (ref) 1.00x slower 1.03x slower 1.08x slower 1.02x faster 1.05x faster

I think this fits your findings so far:

  • For 18.1.8 cg is neutral
  • 19.1.1 without cg is already slightly slower than 18.1.8
  • cg results in clear performance loss for 19.1.1
  • tc vs gc on 19.1.1 shines so much, because gc suffers there, but still 2% faster than 18.1.8 cg
  • tc 20.1.0.rc2 is 5% faster than 18.1.8 cg

Big table

Details

+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| Benchmark                        | clang.pgo.18.1.8.9db1a297d9 | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.9db1a297d9   | clang.pgo.cg.9db1a297d9 | clang.pgo.tc.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+==================================+=============================+================================+========================+=========================+=========================+====================================+
| 2to3                             | 424 ms                      | not significant                | not significant        | 444 ms: 1.05x slower    | 409 ms: 1.04x faster    | 398 ms: 1.07x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_generators                 | 501 ms                      | not significant                | 514 ms: 1.03x slower   | 524 ms: 1.05x slower    | 507 ms: 1.01x slower    | 469 ms: 1.07x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_none                  | 347 ms                      | not significant                | not significant        | 383 ms: 1.11x slower    | not significant         | 330 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_cpu_io_mixed          | 682 ms                      | not significant                | 697 ms: 1.02x slower   | 722 ms: 1.06x slower    | not significant         | 652 ms: 1.04x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_cpu_io_mixed_tg       | 640 ms                      | 657 ms: 1.03x slower           | 665 ms: 1.04x slower   | 692 ms: 1.08x slower    | 653 ms: 1.02x slower    | 630 ms: 1.01x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager                 | 128 ms                      | not significant                | 133 ms: 1.04x slower   | 153 ms: 1.20x slower    | not significant         | 121 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_cpu_io_mixed    | 513 ms                      | not significant                | 535 ms: 1.04x slower   | 566 ms: 1.10x slower    | not significant         | 492 ms: 1.04x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_cpu_io_mixed_tg | 620 ms                      | not significant                | 646 ms: 1.04x slower   | 675 ms: 1.09x slower    | not significant         | 596 ms: 1.04x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_io              | 783 ms                      | 810 ms: 1.03x slower           | 817 ms: 1.04x slower   | 887 ms: 1.13x slower    | not significant         | 743 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_io_tg           | 812 ms                      | not significant                | not significant        | 888 ms: 1.09x slower    | not significant         | 782 ms: 1.04x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_memoization     | 264 ms                      | not significant                | 281 ms: 1.06x slower   | 312 ms: 1.18x slower    | not significant         | 246 ms: 1.07x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_memoization_tg  | 384 ms                      | not significant                | 397 ms: 1.03x slower   | 431 ms: 1.12x slower    | not significant         | 363 ms: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_eager_tg              | 285 ms                      | not significant                | 297 ms: 1.04x slower   | 321 ms: 1.13x slower    | not significant         | 275 ms: 1.03x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_io                    | 805 ms                      | not significant                | 824 ms: 1.02x slower   | 870 ms: 1.08x slower    | not significant         | 766 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_io_tg                 | 794 ms                      | not significant                | 807 ms: 1.02x slower   | 872 ms: 1.10x slower    | not significant         | 752 ms: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_memoization           | 449 ms                      | not significant                | 458 ms: 1.02x slower   | 496 ms: 1.10x slower    | not significant         | 423 ms: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_memoization_tg        | 417 ms                      | not significant                | 425 ms: 1.02x slower   | 459 ms: 1.10x slower    | not significant         | 396 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| async_tree_none_tg               | 341 ms                      | not significant                | 352 ms: 1.03x slower   | 378 ms: 1.11x slower    | not significant         | 329 ms: 1.04x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| asyncio_tcp                      | 1.53 sec                    | 1.61 sec: 1.05x slower         | 1.61 sec: 1.05x slower | 1.62 sec: 1.06x slower  | not significant         | 1.40 sec: 1.09x faster             |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| asyncio_tcp_ssl                  | 4.28 sec                    | not significant                | not significant        | not significant         | not significant         | 3.91 sec: 1.10x faster             |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| asyncio_websockets               | 718 ms                      | not significant                | 740 ms: 1.03x slower   | not significant         | not significant         | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| chaos                            | 66.8 ms                     | 69.3 ms: 1.04x slower          | 74.3 ms: 1.11x slower  | 76.2 ms: 1.14x slower   | 68.4 ms: 1.02x slower   | 65.4 ms: 1.02x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| comprehensions                   | 18.2 us                     | not significant                | 19.2 us: 1.05x slower  | 19.9 us: 1.09x slower   | not significant         | 17.6 us: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| bench_mp_pool                    | 186 ms                      | not significant                | not significant        | not significant         | 174 ms: 1.07x faster    | 171 ms: 1.09x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| bench_thread_pool                | 1.65 ms                     | not significant                | not significant        | 1.68 ms: 1.02x slower   | 1.61 ms: 1.02x faster   | 1.57 ms: 1.05x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| coroutines                       | 25.3 ms                     | 26.4 ms: 1.04x slower          | 26.9 ms: 1.06x slower  | 29.0 ms: 1.14x slower   | 24.9 ms: 1.02x faster   | 24.9 ms: 1.02x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| coverage                         | 102 ms                      | 97.6 ms: 1.05x faster          | not significant        | 126 ms: 1.23x slower    | 96.5 ms: 1.06x faster   | 93.3 ms: 1.09x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| crypto_pyaes                     | 82.5 ms                     | 81.3 ms: 1.01x faster          | 86.3 ms: 1.05x slower  | 90.0 ms: 1.09x slower   | not significant         | 78.0 ms: 1.06x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| deepcopy                         | 306 us                      | 292 us: 1.05x faster           | not significant        | 329 us: 1.07x slower    | 295 us: 1.04x faster    | 289 us: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| deepcopy_reduce                  | 3.34 us                     | 3.16 us: 1.06x faster          | 3.23 us: 1.04x faster  | not significant         | 3.10 us: 1.08x faster   | 3.08 us: 1.09x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| deepcopy_memo                    | 34.1 us                     | 32.0 us: 1.07x faster          | 34.8 us: 1.02x slower  | 37.2 us: 1.09x slower   | 33.3 us: 1.02x faster   | 33.2 us: 1.03x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| deltablue                        | 3.41 ms                     | not significant                | 3.80 ms: 1.11x slower  | 4.15 ms: 1.22x slower   | not significant         | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| django_template                  | 44.5 ms                     | 42.6 ms: 1.04x faster          | 42.1 ms: 1.06x faster  | 45.3 ms: 1.02x slower   | 40.3 ms: 1.10x faster   | 39.0 ms: 1.14x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| docutils                         | 3.22 sec                    | not significant                | 3.31 sec: 1.03x slower | 3.44 sec: 1.07x slower  | 3.16 sec: 1.02x faster  | 3.12 sec: 1.03x faster             |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| dulwich_log                      | 126 ms                      | not significant                | 131 ms: 1.04x slower   | 132 ms: 1.04x slower    | not significant         | 123 ms: 1.03x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| fannkuch                         | 482 ms                      | not significant                | 516 ms: 1.07x slower   | 527 ms: 1.09x slower    | 472 ms: 1.02x faster    | 456 ms: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| float                            | 95.7 ms                     | 93.2 ms: 1.03x faster          | not significant        | 98.5 ms: 1.03x slower   | not significant         | 87.6 ms: 1.09x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| gc_traversal                     | 5.43 ms                     | not significant                | 5.71 ms: 1.05x slower  | not significant         | not significant         | 5.28 ms: 1.03x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| generators                       | 34.2 ms                     | 36.0 ms: 1.05x slower          | 36.0 ms: 1.05x slower  | 41.7 ms: 1.22x slower   | 33.3 ms: 1.03x faster   | 33.0 ms: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| genshi_text                      | 25.0 ms                     | 25.3 ms: 1.01x slower          | 26.3 ms: 1.05x slower  | 28.3 ms: 1.13x slower   | not significant         | 24.2 ms: 1.03x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| genshi_xml                       | 59.6 ms                     | not significant                | 63.1 ms: 1.06x slower  | 68.3 ms: 1.15x slower   | not significant         | 57.2 ms: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| go                               | 125 ms                      | not significant                | 132 ms: 1.06x slower   | 144 ms: 1.16x slower    | not significant         | 122 ms: 1.02x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| hexiom                           | 6.57 ms                     | not significant                | 7.11 ms: 1.08x slower  | 7.68 ms: 1.17x slower   | not significant         | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| html5lib                         | 69.2 ms                     | 72.1 ms: 1.04x slower          | 74.5 ms: 1.08x slower  | 76.9 ms: 1.11x slower   | 67.8 ms: 1.02x faster   | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| json_dumps                       | 12.4 ms                     | 13.3 ms: 1.07x slower          | 12.9 ms: 1.04x slower  | 13.7 ms: 1.11x slower   | not significant         | 11.5 ms: 1.08x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| json_loads                       | 32.2 us                     | not significant                | not significant        | not significant         | not significant         | 30.5 us: 1.05x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| logging_format                   | 13.3 us                     | not significant                | 13.6 us: 1.02x slower  | 14.3 us: 1.08x slower   | 12.0 us: 1.11x faster   | 11.6 us: 1.15x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| logging_silent                   | 103 ns                      | not significant                | 109 ns: 1.06x slower   | 118 ns: 1.14x slower    | 106 ns: 1.03x slower    | 101 ns: 1.02x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| logging_simple                   | 11.8 us                     | not significant                | 12.2 us: 1.04x slower  | 12.8 us: 1.09x slower   | 10.9 us: 1.08x faster   | 10.4 us: 1.13x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| mako                             | 14.3 ms                     | not significant                | not significant        | 15.5 ms: 1.08x slower   | not significant         | 13.2 ms: 1.08x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| mdp                              | 3.28 sec                    | not significant                | 3.37 sec: 1.03x slower | 3.40 sec: 1.04x slower  | not significant         | 2.91 sec: 1.13x faster             |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| meteor_contest                   | 113 ms                      | 115 ms: 1.02x slower           | 124 ms: 1.10x slower   | 126 ms: 1.12x slower    | 120 ms: 1.06x slower    | 117 ms: 1.04x slower               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| nbody                            | 119 ms                      | not significant                | 128 ms: 1.08x slower   | 139 ms: 1.18x slower    | not significant         | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| nqueens                          | 98.6 ms                     | not significant                | 103 ms: 1.05x slower   | 107 ms: 1.08x slower    | not significant         | 92.0 ms: 1.07x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pathlib                          | 252 ms                      | not significant                | 262 ms: 1.04x slower   | 255 ms: 1.01x slower    | 245 ms: 1.03x faster    | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pickle                           | 15.1 us                     | 15.6 us: 1.03x slower          | not significant        | 14.3 us: 1.06x faster   | 14.3 us: 1.05x faster   | 14.3 us: 1.06x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pickle_dict                      | 29.8 us                     | not significant                | 27.6 us: 1.08x faster  | 27.3 us: 1.09x faster   | 26.7 us: 1.12x faster   | 27.9 us: 1.07x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pickle_list                      | 5.17 us                     | not significant                | 5.05 us: 1.03x faster  | 5.01 us: 1.03x faster   | 4.96 us: 1.04x faster   | 4.94 us: 1.05x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pickle_pure_python               | 359 us                      | 373 us: 1.04x slower           | 378 us: 1.05x slower   | 412 us: 1.15x slower    | not significant         | 350 us: 1.03x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pidigits                         | 228 ms                      | not significant                | 240 ms: 1.05x slower   | 236 ms: 1.04x slower    | 234 ms: 1.03x slower    | 233 ms: 1.02x slower               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pprint_safe_repr                 | 892 ms                      | 877 ms: 1.02x faster           | 934 ms: 1.05x slower   | 986 ms: 1.11x slower    | not significant         | 812 ms: 1.10x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pprint_pformat                   | 1.81 sec                    | 1.79 sec: 1.01x faster         | 1.91 sec: 1.05x slower | 2.02 sec: 1.12x slower  | not significant         | 1.66 sec: 1.09x faster             |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| pyflate                          | 531 ms                      | 519 ms: 1.02x faster           | not significant        | 574 ms: 1.08x slower    | 522 ms: 1.02x faster    | 506 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| python_startup                   | 44.2 ms                     | not significant                | not significant        | not significant         | 43.0 ms: 1.03x faster   | 42.6 ms: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| python_startup_no_site           | 36.5 ms                     | not significant                | not significant        | not significant         | not significant         | 35.5 ms: 1.03x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| raytrace                         | 303 ms                      | 307 ms: 1.01x slower           | 321 ms: 1.06x slower   | 330 ms: 1.09x slower    | 307 ms: 1.01x slower    | 294 ms: 1.03x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| regex_compile                    | 148 ms                      | 145 ms: 1.02x faster           | 157 ms: 1.06x slower   | 162 ms: 1.10x slower    | 141 ms: 1.05x faster    | 137 ms: 1.08x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| regex_dna                        | 220 ms                      | 209 ms: 1.05x faster           | 211 ms: 1.04x faster   | 207 ms: 1.06x faster    | 205 ms: 1.07x faster    | 207 ms: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| regex_effbot                     | 3.36 ms                     | 3.61 ms: 1.07x slower          | not significant        | 3.52 ms: 1.05x slower   | 3.30 ms: 1.02x faster   | 3.20 ms: 1.05x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| regex_v8                         | 29.2 ms                     | 28.2 ms: 1.04x faster          | 29.8 ms: 1.02x slower  | not significant         | 28.7 ms: 1.02x faster   | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| richards                         | 46.9 ms                     | not significant                | 49.7 ms: 1.06x slower  | 56.8 ms: 1.21x slower   | 45.2 ms: 1.04x faster   | 45.9 ms: 1.02x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| richards_super                   | 54.6 ms                     | not significant                | 56.2 ms: 1.03x slower  | 62.3 ms: 1.14x slower   | 53.0 ms: 1.03x faster   | 52.2 ms: 1.05x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| scimark_fft                      | 347 ms                      | 330 ms: 1.05x faster           | 358 ms: 1.03x slower   | 383 ms: 1.10x slower    | 337 ms: 1.03x faster    | 315 ms: 1.10x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| scimark_lu                       | 128 ms                      | 125 ms: 1.02x faster           | 132 ms: 1.04x slower   | 134 ms: 1.05x slower    | 121 ms: 1.05x faster    | 119 ms: 1.07x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| scimark_monte_carlo              | 71.9 ms                     | 69.4 ms: 1.04x faster          | 74.6 ms: 1.04x slower  | 81.7 ms: 1.14x slower   | 69.6 ms: 1.03x faster   | 69.6 ms: 1.03x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| scimark_sor                      | 129 ms                      | not significant                | 151 ms: 1.17x slower   | 154 ms: 1.19x slower    | not significant         | 126 ms: 1.03x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| scimark_sparse_mat_mult          | 4.85 ms                     | 4.63 ms: 1.05x faster          | 5.01 ms: 1.03x slower  | 5.02 ms: 1.04x slower   | not significant         | 4.64 ms: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| spectral_norm                    | 101 ms                      | 103 ms: 1.02x slower           | 110 ms: 1.08x slower   | 114 ms: 1.13x slower    | not significant         | 99.4 ms: 1.02x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sqlglot_normalize                | 132 ms                      | 128 ms: 1.03x faster           | not significant        | 135 ms: 1.02x slower    | 126 ms: 1.05x faster    | 119 ms: 1.11x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sqlglot_optimize                 | 65.8 ms                     | 64.5 ms: 1.02x faster          | not significant        | 68.5 ms: 1.04x slower   | 63.4 ms: 1.04x faster   | 60.0 ms: 1.10x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sqlglot_parse                    | 1.43 ms                     | 1.41 ms: 1.02x faster          | 1.51 ms: 1.06x slower  | 1.60 ms: 1.12x slower   | 1.38 ms: 1.04x faster   | 1.35 ms: 1.06x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sqlglot_transpile                | 1.77 ms                     | not significant                | 1.85 ms: 1.05x slower  | 1.95 ms: 1.10x slower   | 1.71 ms: 1.04x faster   | 1.70 ms: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sqlite_synth                     | 3.76 us                     | 3.55 us: 1.06x faster          | 3.44 us: 1.09x faster  | 3.39 us: 1.11x faster   | 3.31 us: 1.14x faster   | 3.26 us: 1.15x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sympy_expand                     | 559 ms                      | 568 ms: 1.01x slower           | 578 ms: 1.03x slower   | 598 ms: 1.07x slower    | 542 ms: 1.03x faster    | 518 ms: 1.08x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sympy_integrate                  | 23.9 ms                     | not significant                | not significant        | 25.9 ms: 1.08x slower   | 23.2 ms: 1.03x faster   | 22.4 ms: 1.06x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sympy_sum                        | 190 ms                      | not significant                | 199 ms: 1.05x slower   | 205 ms: 1.08x slower    | 184 ms: 1.03x faster    | 181 ms: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| sympy_str                        | 334 ms                      | not significant                | 344 ms: 1.03x slower   | 356 ms: 1.07x slower    | 323 ms: 1.03x faster    | 315 ms: 1.06x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| telco                            | 9.17 ms                     | 9.50 ms: 1.04x slower          | 9.37 ms: 1.02x slower  | 9.43 ms: 1.03x slower   | not significant         | 8.70 ms: 1.05x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| tomli_loads                      | 2.19 sec                    | not significant                | 2.38 sec: 1.09x slower | 2.53 sec: 1.16x slower  | not significant         | 2.13 sec: 1.03x faster             |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| typing_runtime_protocols         | 188 us                      | 193 us: 1.02x slower           | 193 us: 1.02x slower   | 201 us: 1.07x slower    | not significant         | 175 us: 1.07x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| unpack_sequence                  | 56.8 ns                     | not significant                | 59.3 ns: 1.04x slower  | 64.5 ns: 1.14x slower   | not significant         | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| unpickle                         | 17.6 us                     | 17.9 us: 1.02x slower          | 17.9 us: 1.02x slower  | 17.0 us: 1.04x faster   | 17.3 us: 1.02x faster   | 16.9 us: 1.04x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| unpickle_list                    | 5.17 us                     | 5.39 us: 1.04x slower          | 5.38 us: 1.04x slower  | not significant         | not significant         | not significant                    |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| unpickle_pure_python             | 247 us                      | 252 us: 1.02x slower           | 257 us: 1.04x slower   | 279 us: 1.13x slower    | 236 us: 1.05x faster    | 235 us: 1.05x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| xml_etree_parse                  | 211 ms                      | not significant                | not significant        | 219 ms: 1.04x slower    | not significant         | 202 ms: 1.04x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| xml_etree_generate               | 120 ms                      | not significant                | not significant        | not significant         | 113 ms: 1.06x faster    | 109 ms: 1.10x faster               |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| xml_etree_process                | 81.3 ms                     | 82.2 ms: 1.01x slower          | not significant        | 84.5 ms: 1.04x slower   | 76.8 ms: 1.06x faster   | 73.6 ms: 1.10x faster              |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+
| Geometric mean                   | (ref)                       | 1.00x slower                   | 1.03x slower           | 1.08x slower            | 1.02x faster            | 1.05x faster                       |
+----------------------------------+-----------------------------+--------------------------------+------------------------+-------------------------+-------------------------+------------------------------------+

Only clang.pgo.cg.18.1.8.9db1a297d9 vs clang.pgo.tc.20.1.0.rc2.9db1a297d9 using --group-by-speed

Details

+----------------------------------+--------------------------------+------------------------------------+
| Benchmark                        | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+==================================+================================+====================================+
| async_tree_eager_io              | 810 ms                         | 743 ms: 1.09x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_memoization     | 268 ms                         | 246 ms: 1.09x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_memoization           | 455 ms                         | 423 ms: 1.08x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_io_tg           | 837 ms                         | 782 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager                 | 130 ms                         | 121 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_io_tg                 | 803 ms                         | 752 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_memoization_tg  | 387 ms                         | 363 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_io                    | 815 ms                         | 766 ms: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_none                  | 347 ms                         | 330 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_cpu_io_mixed_tg | 627 ms                         | 596 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_memoization_tg        | 416 ms                         | 396 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_cpu_io_mixed    | 516 ms                         | 492 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_cpu_io_mixed_tg       | 657 ms                         | 630 ms: 1.04x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_none_tg               | 343 ms                         | 329 ms: 1.04x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_cpu_io_mixed          | 677 ms                         | 652 ms: 1.04x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_tg              | 285 ms                         | 275 ms: 1.03x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| Geometric mean                   | (ref)                          | 1.06x faster                       |
+----------------------------------+--------------------------------+------------------------------------+

Benchmarks with tag 'math':
===========================

+----------------+--------------------------------+------------------------------------+
| Benchmark      | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+================+================================+====================================+
| float          | 93.2 ms                        | 87.6 ms: 1.06x faster              |
+----------------+--------------------------------+------------------------------------+
| pidigits       | 228 ms                         | 233 ms: 1.02x slower               |
+----------------+--------------------------------+------------------------------------+
| Geometric mean | (ref)                          | 1.01x faster                       |
+----------------+--------------------------------+------------------------------------+

Benchmark hidden because not significant (1): nbody

Benchmarks with tag 'regex':
============================

+----------------+--------------------------------+------------------------------------+
| Benchmark      | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+================+================================+====================================+
| regex_effbot   | 3.61 ms                        | 3.20 ms: 1.13x faster              |
+----------------+--------------------------------+------------------------------------+
| regex_compile  | 145 ms                         | 137 ms: 1.06x faster               |
+----------------+--------------------------------+------------------------------------+
| regex_v8       | 28.2 ms                        | 28.8 ms: 1.02x slower              |
+----------------+--------------------------------+------------------------------------+
| Geometric mean | (ref)                          | 1.04x faster                       |
+----------------+--------------------------------+------------------------------------+

Benchmark hidden because not significant (1): regex_dna

Benchmarks with tag 'serialize':
================================

+----------------------+--------------------------------+------------------------------------+
| Benchmark            | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+======================+================================+====================================+
| json_dumps           | 13.3 ms                        | 11.5 ms: 1.15x faster              |
+----------------------+--------------------------------+------------------------------------+
| xml_etree_process    | 82.2 ms                        | 73.6 ms: 1.12x faster              |
+----------------------+--------------------------------+------------------------------------+
| xml_etree_generate   | 121 ms                         | 109 ms: 1.12x faster               |
+----------------------+--------------------------------+------------------------------------+
| pickle               | 15.6 us                        | 14.3 us: 1.09x faster              |
+----------------------+--------------------------------+------------------------------------+
| unpickle_pure_python | 252 us                         | 235 us: 1.07x faster               |
+----------------------+--------------------------------+------------------------------------+
| pickle_list          | 5.28 us                        | 4.94 us: 1.07x faster              |
+----------------------+--------------------------------+------------------------------------+
| pickle_dict          | 29.8 us                        | 27.9 us: 1.07x faster              |
+----------------------+--------------------------------+------------------------------------+
| pickle_pure_python   | 373 us                         | 350 us: 1.06x faster               |
+----------------------+--------------------------------+------------------------------------+
| unpickle_list        | 5.39 us                        | 5.08 us: 1.06x faster              |
+----------------------+--------------------------------+------------------------------------+
| unpickle             | 17.9 us                        | 16.9 us: 1.06x faster              |
+----------------------+--------------------------------+------------------------------------+
| json_loads           | 31.9 us                        | 30.5 us: 1.04x faster              |
+----------------------+--------------------------------+------------------------------------+
| xml_etree_parse      | 209 ms                         | 202 ms: 1.03x faster               |
+----------------------+--------------------------------+------------------------------------+
| tomli_loads          | 2.18 sec                       | 2.13 sec: 1.03x faster             |
+----------------------+--------------------------------+------------------------------------+
| Geometric mean       | (ref)                          | 1.07x faster                       |
+----------------------+--------------------------------+------------------------------------+

Benchmark hidden because not significant (1): xml_etree_iterparse

Benchmarks with tag 'startup':
==============================

+------------------------+--------------------------------+------------------------------------+
| Benchmark              | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+========================+================================+====================================+
| python_startup         | 43.9 ms                        | 42.6 ms: 1.03x faster              |
+------------------------+--------------------------------+------------------------------------+
| python_startup_no_site | 36.4 ms                        | 35.5 ms: 1.03x faster              |
+------------------------+--------------------------------+------------------------------------+
| Geometric mean         | (ref)                          | 1.03x faster                       |
+------------------------+--------------------------------+------------------------------------+

Benchmarks with tag 'template':
===============================

+-----------------+--------------------------------+------------------------------------+
| Benchmark       | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+=================+================================+====================================+
| django_template | 42.6 ms                        | 39.0 ms: 1.09x faster              |
+-----------------+--------------------------------+------------------------------------+
| mako            | 14.3 ms                        | 13.2 ms: 1.08x faster              |
+-----------------+--------------------------------+------------------------------------+
| genshi_xml      | 60.4 ms                        | 57.2 ms: 1.06x faster              |
+-----------------+--------------------------------+------------------------------------+
| genshi_text     | 25.3 ms                        | 24.2 ms: 1.04x faster              |
+-----------------+--------------------------------+------------------------------------+
| Geometric mean  | (ref)                          | 1.07x faster                       |
+-----------------+--------------------------------+------------------------------------+

All benchmarks:
===============

+----------------------------------+--------------------------------+------------------------------------+
| Benchmark                        | clang.pgo.cg.18.1.8.9db1a297d9 | clang.pgo.tc.20.1.0.rc2.9db1a297d9 |
+==================================+================================+====================================+
| json_dumps                       | 13.3 ms                        | 11.5 ms: 1.15x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| asyncio_tcp                      | 1.61 sec                       | 1.40 sec: 1.15x faster             |
+----------------------------------+--------------------------------+------------------------------------+
| logging_format                   | 13.1 us                        | 11.6 us: 1.14x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| regex_effbot                     | 3.61 ms                        | 3.20 ms: 1.13x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| mdp                              | 3.28 sec                       | 2.91 sec: 1.13x faster             |
+----------------------------------+--------------------------------+------------------------------------+
| xml_etree_process                | 82.2 ms                        | 73.6 ms: 1.12x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| xml_etree_generate               | 121 ms                         | 109 ms: 1.12x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| logging_simple                   | 11.5 us                        | 10.4 us: 1.10x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| typing_runtime_protocols         | 193 us                         | 175 us: 1.10x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| asyncio_tcp_ssl                  | 4.30 sec                       | 3.91 sec: 1.10x faster             |
+----------------------------------+--------------------------------+------------------------------------+
| sympy_expand                     | 568 ms                         | 518 ms: 1.10x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| telco                            | 9.50 ms                        | 8.70 ms: 1.09x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| django_template                  | 42.6 ms                        | 39.0 ms: 1.09x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_io              | 810 ms                         | 743 ms: 1.09x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| generators                       | 36.0 ms                        | 33.0 ms: 1.09x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| pickle                           | 15.6 us                        | 14.3 us: 1.09x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_memoization     | 268 ms                         | 246 ms: 1.09x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| sqlite_synth                     | 3.55 us                        | 3.26 us: 1.09x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| mako                             | 14.3 ms                        | 13.2 ms: 1.08x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| pprint_safe_repr                 | 877 ms                         | 812 ms: 1.08x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_generators                 | 506 ms                         | 469 ms: 1.08x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| pprint_pformat                   | 1.79 sec                       | 1.66 sec: 1.08x faster             |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_memoization           | 455 ms                         | 423 ms: 1.08x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| sympy_integrate                  | 24.1 ms                        | 22.4 ms: 1.08x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| sqlglot_optimize                 | 64.5 ms                        | 60.0 ms: 1.07x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_io_tg           | 837 ms                         | 782 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| sqlglot_normalize                | 128 ms                         | 119 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| unpickle_pure_python             | 252 us                         | 235 us: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| pickle_list                      | 5.28 us                        | 4.94 us: 1.07x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager                 | 130 ms                         | 121 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| sympy_str                        | 336 ms                         | 315 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_io_tg                 | 803 ms                         | 752 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| nqueens                          | 98.2 ms                        | 92.0 ms: 1.07x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| pickle_dict                      | 29.8 us                        | 27.9 us: 1.07x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_memoization_tg  | 387 ms                         | 363 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| bench_mp_pool                    | 182 ms                         | 171 ms: 1.07x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| pickle_pure_python               | 373 us                         | 350 us: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| float                            | 93.2 ms                        | 87.6 ms: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_io                    | 815 ms                         | 766 ms: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| regex_compile                    | 145 ms                         | 137 ms: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| unpickle_list                    | 5.39 us                        | 5.08 us: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| coroutines                       | 26.4 ms                        | 24.9 ms: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| chaos                            | 69.3 ms                        | 65.4 ms: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| 2to3                             | 422 ms                         | 398 ms: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| unpickle                         | 17.9 us                        | 16.9 us: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| sympy_sum                        | 192 ms                         | 181 ms: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| bench_thread_pool                | 1.66 ms                        | 1.57 ms: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| genshi_xml                       | 60.4 ms                        | 57.2 ms: 1.06x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| fannkuch                         | 481 ms                         | 456 ms: 1.06x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| scimark_lu                       | 125 ms                         | 119 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_none                  | 347 ms                         | 330 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_cpu_io_mixed_tg | 627 ms                         | 596 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| html5lib                         | 72.1 ms                        | 68.6 ms: 1.05x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| richards_super                   | 54.9 ms                        | 52.2 ms: 1.05x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_memoization_tg        | 416 ms                         | 396 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_cpu_io_mixed    | 516 ms                         | 492 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| scimark_fft                      | 330 ms                         | 315 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| coverage                         | 97.6 ms                        | 93.3 ms: 1.05x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| raytrace                         | 307 ms                         | 294 ms: 1.05x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| genshi_text                      | 25.3 ms                        | 24.2 ms: 1.04x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| json_loads                       | 31.9 us                        | 30.5 us: 1.04x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_cpu_io_mixed_tg       | 657 ms                         | 630 ms: 1.04x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| crypto_pyaes                     | 81.3 ms                        | 78.0 ms: 1.04x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_none_tg               | 343 ms                         | 329 ms: 1.04x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| sqlglot_parse                    | 1.41 ms                        | 1.35 ms: 1.04x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_cpu_io_mixed          | 677 ms                         | 652 ms: 1.04x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| spectral_norm                    | 103 ms                         | 99.4 ms: 1.04x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| docutils                         | 3.22 sec                       | 3.12 sec: 1.03x faster             |
+----------------------------------+--------------------------------+------------------------------------+
| xml_etree_parse                  | 209 ms                         | 202 ms: 1.03x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| dulwich_log                      | 127 ms                         | 123 ms: 1.03x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| richards                         | 47.5 ms                        | 45.9 ms: 1.03x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| async_tree_eager_tg              | 285 ms                         | 275 ms: 1.03x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| sqlglot_transpile                | 1.75 ms                        | 1.70 ms: 1.03x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| python_startup                   | 43.9 ms                        | 42.6 ms: 1.03x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| comprehensions                   | 18.1 us                        | 17.6 us: 1.03x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| tomli_loads                      | 2.18 sec                       | 2.13 sec: 1.03x faster             |
+----------------------------------+--------------------------------+------------------------------------+
| python_startup_no_site           | 36.4 ms                        | 35.5 ms: 1.03x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| deepcopy_reduce                  | 3.16 us                        | 3.08 us: 1.03x faster              |
+----------------------------------+--------------------------------+------------------------------------+
| pyflate                          | 519 ms                         | 506 ms: 1.02x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| scimark_sor                      | 129 ms                         | 126 ms: 1.02x faster               |
+----------------------------------+--------------------------------+------------------------------------+
| meteor_contest                   | 115 ms                         | 117 ms: 1.01x slower               |
+----------------------------------+--------------------------------+------------------------------------+
| regex_v8                         | 28.2 ms                        | 28.8 ms: 1.02x slower              |
+----------------------------------+--------------------------------+------------------------------------+
| pidigits                         | 228 ms                         | 233 ms: 1.02x slower               |
+----------------------------------+--------------------------------+------------------------------------+
| deepcopy_memo                    | 32.0 us                        | 33.2 us: 1.04x slower              |
+----------------------------------+--------------------------------+------------------------------------+
| Geometric mean                   | (ref)                          | 1.05x faster                       |
+----------------------------------+--------------------------------+------------------------------------+

@nelhage
Copy link

nelhage commented Mar 6, 2025

could I trouble you to rerun the benchmarks with clang 20 please and with the patches for computed gotos applied please

I tested with a pre-release clang 20 rc and saw comparable results to clang 19 (for both computed goto and tail-call). I'll see if I can test the 20.1 release once it lands in nixpkgs.

llvm/llvm-project#114990 is the PR that fixes the LLVM regression and it hasn't landed yet. I did test clang19 + that patch, and saw comparable performance to the -mllvm tunable.

@nelhage
Copy link

nelhage commented Mar 6, 2025

@chris-eibl Amazing, thanks for reposting/analyzing those. I agree those seem consistent with my results.

Interesting that the bugs makes computed gotos slower than the switch-based dispatch. I wonder if some component is making codegen or regalloc decisions which are a good idea iff tail-duplication happens. I'll run some benchmarks of my own without computed goto, for comparison.

@nelhage
Copy link

nelhage commented Mar 7, 2025

Okay, this is pretty entertaining to me. It appears -- at least in my environment -- that clang18 is also able to tail-duplicate the dispatch even without --with-computed-gotos, and thus generates code with near-identical performance with and without the option.

$ objdump -S --disassemble=_PyEval_EvalFrameDefault ${clang18nocg}/bin/python3.14 | egrep -c 'jmp\s*\*'
306

I wonder how longer that's been true; it certainly complicates experiments to try to demonstrate the performance advantage of duplicating the dispatch, if the compiler will sometimes do it for you anyways.

clang 19 fails to tail-duplicate, just like it does with computed goto:

$ objdump -S --disassemble=_PyEval_EvalFrameDefault ${clang19nocg}/bin/python3.14 | egrep -c 'jmp\s*\*'
3

That does leave me confused why @chris-eibl saw much better performance with clang19 without computed gotos.

I'm running that benchmark on my own environment to replicate, but at a glance I notice that clang19 without computed gotos manages to tail-duplicate more of the dispatch logic, and maybe that helps the pipeline out somehow. e.g.:

Merged tail on clang19, with computed gotos:

  21ac69:       0f b7 03                movzwl (%rbx),%eax
  21ac6c:       44 0f b6 f0             movzbl %al,%r14d
  21ac70:       41 89 c3                mov    %eax,%r11d
  21ac73:       41 c1 eb 08             shr    $0x8,%r11d
  21ac77:       49 89 df                mov    %rbx,%r15
  21ac7a:       48 8b 9d c0 7e ff ff    mov    -0x8140(%rbp),%rbx
  21ac81:       48 89 9d c0 7e ff ff    mov    %rbx,-0x8140(%rbp)
  21ac88:       48 8d 05 51 6d 3e 00    lea    0x3e6d51(%rip),%rax        # 6019e0 <_PyEval_EvalFrameDefault.opcode_targets>
  21ac8f:       42 ff 24 f0             jmp    *(%rax,%r14,8)

Merged tail on clang19, without computed gotos:

  211be0:       0f b6 c3                movzbl %bl,%eax
  211be3:       49 63 04 82             movslq (%r10,%rax,4),%rax
  211be7:       4c 01 d0                add    %r10,%rax
  211bea:       ff e0                   jmp    *%rax

@nelhage
Copy link

nelhage commented Mar 7, 2025

I'm running that benchmark on my own environment to replicate

It replicates on my machine, more-or-less:

Benchmark clang18 clang19 clang19.nocg clang19.tc
Geometric mean (ref) 1.09x slower 1.02x slower 1.03x faster

So clang19 must be doing something pathological to the computed gotos, beyond just the failure to tail-merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.