Skip to content

Fix cache and prediction misses in Target::qargs#16476

Merged
mtreinish merged 1 commit into
Qiskit:mainfrom
jakelishman:vf2/target-regression
Jun 23, 2026
Merged

Fix cache and prediction misses in Target::qargs#16476
mtreinish merged 1 commit into
Qiskit:mainfrom
jakelishman:vf2/target-regression

Conversation

@jakelishman

@jakelishman jakelishman commented Jun 23, 2026

Copy link
Copy Markdown
Member

Fix cache and prediction misses in Target::qargs

In 2.5.0rc1 we noticed a significant slowdown in VF2-dominated all-to-all connectivity transpilation benchmarks, which this fixes.

Background

We recently changed the internal hash-map data structures in the Target to consolidate various properties, and avoid IndexSet tracking overhead in the qargs tracking1; it wasn't generally needed for determinism. However, randomising the iteration order of the Qargs means that graphs constructed from them (like Target::coupling_graph or VF2's custom graph build) add their edges in random orders. Edge and neighbour search/iteration methods on graphs involve following a linked-list-like edge list, which is highly susceptible to cache and branch-prediction problems; it's far faster if these accesses are predictable. For our purposes here with all-to-all targets, it's the cache properties that matter, and the branch-prediction is about the same.

Swapping the qargs_gate_map back to IndexSet does not itself enforce structure in the edge list, but in practice, a Target will be constructed programmatically, and there will be some logical structure in the construction. IndexMap preserves this, whereas randomisation is almost guaranteed to be worse. We could attempt to optimise the edge list, but sorting arbitrary lists would likely have worse overhead and not significantly improve on most normal constructions.

Timing

Using the wstate_n380.qasm file from QASMBench2, we have the following minimised benchmark reproducing the problem:

from qiskit.circuit import QuantumCircuit
from qiskit.transpiler import (
    generate_preset_pass_manager,
    CouplingMap,
    passes,
)
from qiskit.providers.fake_provider import GenericBackendV2

cmap = CouplingMap.from_full(380)
backend = GenericBackendV2(
    cmap.size(),
    coupling_map=cmap,
    basis_gates=["id", "sx", "x", "rz", "cz"],
    seed=42,
)
dag = QuantumCircuit.from_qasm_file("wstate_n380.qasm").to_dag()
pm = generate_preset_pass_manager(backend, seed_transpiler=42)
pass_ = passes.VF2Layout(
    coupling_map=cmap,
    seed=-1,
    call_limit=(5_000_000, 10_000),
    target=backend.target,
)

We time pass_.run(dag). On 2.4.2, this takes about 250ms on my machine. On 2.5.0rc1, it is about 550ms. This commit reverts the timing back to durations statistically compatible with 2.4.2.

Looking at

perf -e cache-misses -- python bench.py

the baseline is ~70M cache misses, then with 10 loops of pass_.run at the end, we find

  • 150M with 2.4.2
  • 340M with 2.5.0rc1
  • 140M with this patch

Correcting for the baseline, this means 2.5.0rc1 has 3-4x the cache-miss rate on this benchmark, and this patch restores the previous rate.

AI/LLM disclosure

  • I didn't use LLM tooling, or only used it privately.
  • I used the following tool to help write this PR description:
  • I used the following tool to generate or modify code:

Footnotes

  1. 61e3ca0: Consolidate Target mappings (Consolidate Target mappings #15349)

  2. https://github.com/pnnl/QASMBench/blob/357b942396d5c2b7cbc1c229c585a6ef5ccaebac/large/wstate_n380/wstate_n380.qasm

@jakelishman jakelishman added this to the 2.5.0 milestone Jun 23, 2026
@jakelishman jakelishman requested a review from a team as a code owner June 23, 2026 10:33
@jakelishman jakelishman requested a review from gadial June 23, 2026 10:33
@jakelishman jakelishman added mod: transpiler Issues and PRs related to Transpiler Changelog: Performance Performance improvements without API and semantic changes. labels Jun 23, 2026
@qiskit-bot

Copy link
Copy Markdown
Collaborator

One or more of the following people are relevant to this code:

  • @Qiskit/terra-core

---
performance:
- |
Fixed a performance regression only in v2.5.0rc1 when running :class:`.VF2Layout` and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a reno if it's not released?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind much either way here - happy to go with whichever people prefer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue probably not since we don't publish release notes for 2.5.0rc1 anywhere. The release notes get aggregated as part of the single 2.5.0 entry when published. It will look a bit odd sitting there next to all the performance improvements for 2.5.0 and then we say we fixed a regression in a pre-release with no mention of it anywhere else.

That being said I don't really care enough to block over this, we can always just delete it in #16454 if that's what we decide to do.

@Cryoris Cryoris left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good, though I would wait for Matt to re-run his benchmarks (or I can do it too but ofc with a less beefy machine 😛 )

In 2.5.0rc1 we noticed a significant slowdown in VF2-dominated
all-to-all connectivity transpilation benchmarks, which this fixes.

Background
----------

We recently changed the internal hash-map data structures in the
`Target` to consolidate various properties, and avoid `IndexSet`
tracking overhead in the qargs tracking[^1]; it wasn't generally needed
for determinism.  However, randomising the iteration order of the
`Qargs` means that graphs constructed from them (like
`Target::coupling_graph` or VF2's custom graph build) add their edges in
random orders.  Edge and neighbour search/iteration methods on graphs
involve following a linked-list-like edge list, which is highly
susceptible to cache and branch-prediction problems; it's far faster if
these accesses are predictable.  For our purposes here with all-to-all
targets, it's the cache properties that matter, and the
branch-prediction is about the same.

Swapping the `qargs_gate_map` back to `IndexSet` does not _itself_
enforce structure in the edge list, but in practice, a `Target` will be
constructed programmatically, and there will be some logical structure
in the construction.  `IndexMap` preserves this, whereas randomisation
is almost guaranteed to be worse.  We could attempt to optimise the edge
list, but sorting arbitrary lists would likely have worse overhead and
not significantly improve on most normal constructions.

Timing
------

Using the `wstate_n380.qasm` file from QASMBench[^2], we have the
following minimised benchmark reproducing the problem:

```python
from qiskit.circuit import QuantumCircuit
from qiskit.transpiler import (
    generate_preset_pass_manager,
    CouplingMap,
    passes,
)
from qiskit.providers.fake_provider import GenericBackendV2

cmap = CouplingMap.from_full(380)
backend = GenericBackendV2(
    cmap.size(),
    coupling_map=cmap,
    basis_gates=["id", "sx", "x", "rz", "cz"],
    seed=42,
)
dag = QuantumCircuit.from_qasm_file("wstate_n380.qasm").to_dag()
pm = generate_preset_pass_manager(backend, seed_transpiler=42)
pass_ = passes.VF2Layout(
    coupling_map=cmap,
    seed=-1,
    call_limit=(5_000_000, 10_000),
    target=backend.target,
)
```

We time `pass_.run(dag)`.  On 2.4.2, this takes about 250ms on my
machine.  On 2.5.0rc1, it is about 550ms.  This commit reverts the
timing back to durations statistically compatible with 2.4.2.

Looking at

```
perf -e cache-misses -- python bench.py
```

the baseline is ~70M cache misses, then with 10 loops of `pass_.run` at
the end, we find

- 150M with 2.4.2
- 340M with 2.5.0rc1
- 140M with this patch

Correcting for the baseline, this means 2.5.0rc1 has 3-4x the cache-miss
rate on this benchmark, and this patch restores the previous rate.

[^1]: 61e3ca0: Consolidate `Target` mappings (Qiskit#15349)
[^2]: https://github.com/pnnl/QASMBench/blob/357b942396d5c2b7cbc1c229c585a6ef5ccaebac/large/wstate_n380/wstate_n380.qasm
@jakelishman jakelishman force-pushed the vf2/target-regression branch from d1617c3 to 83c55d8 Compare June 23, 2026 14:34
@jakelishman

Copy link
Copy Markdown
Member Author

Force pushed just to include some more information in the commit message about the reason for the performance changes: it's the cache misses, with numbers.

@mtreinish mtreinish left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thanks for digging into this. I've confirmed the regression is fixed and looking at hardware counters on cache hit rate are similar showing data that with this PR it's fixing the access patterns for better locality and less cache missing.

@mtreinish mtreinish enabled auto-merge June 23, 2026 14:41
@jakelishman jakelishman added the stable backport potential Make Mergify open a backport PR to the most recent stable branch on merge. label Jun 23, 2026
@mtreinish mtreinish added this pull request to the merge queue Jun 23, 2026
@mergify

mergify Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

Merged via the queue into Qiskit:main with commit 9da141a Jun 23, 2026
28 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in Qiskit 2.5 Jun 23, 2026
@jakelishman jakelishman deleted the vf2/target-regression branch June 23, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changelog: Performance Performance improvements without API and semantic changes. mod: transpiler Issues and PRs related to Transpiler stable backport potential Make Mergify open a backport PR to the most recent stable branch on merge.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants