Fix cache and prediction misses in Target::qargs#16476
Conversation
|
One or more of the following people are relevant to this code:
|
| --- | ||
| performance: | ||
| - | | ||
| Fixed a performance regression only in v2.5.0rc1 when running :class:`.VF2Layout` and |
There was a problem hiding this comment.
Do we need a reno if it's not released?
There was a problem hiding this comment.
I don't mind much either way here - happy to go with whichever people prefer.
There was a problem hiding this comment.
I'd argue probably not since we don't publish release notes for 2.5.0rc1 anywhere. The release notes get aggregated as part of the single 2.5.0 entry when published. It will look a bit odd sitting there next to all the performance improvements for 2.5.0 and then we say we fixed a regression in a pre-release with no mention of it anywhere else.
That being said I don't really care enough to block over this, we can always just delete it in #16454 if that's what we decide to do.
Cryoris
left a comment
There was a problem hiding this comment.
The change looks good, though I would wait for Matt to re-run his benchmarks (or I can do it too but ofc with a less beefy machine 😛 )
In 2.5.0rc1 we noticed a significant slowdown in VF2-dominated
all-to-all connectivity transpilation benchmarks, which this fixes.
Background
----------
We recently changed the internal hash-map data structures in the
`Target` to consolidate various properties, and avoid `IndexSet`
tracking overhead in the qargs tracking[^1]; it wasn't generally needed
for determinism. However, randomising the iteration order of the
`Qargs` means that graphs constructed from them (like
`Target::coupling_graph` or VF2's custom graph build) add their edges in
random orders. Edge and neighbour search/iteration methods on graphs
involve following a linked-list-like edge list, which is highly
susceptible to cache and branch-prediction problems; it's far faster if
these accesses are predictable. For our purposes here with all-to-all
targets, it's the cache properties that matter, and the
branch-prediction is about the same.
Swapping the `qargs_gate_map` back to `IndexSet` does not _itself_
enforce structure in the edge list, but in practice, a `Target` will be
constructed programmatically, and there will be some logical structure
in the construction. `IndexMap` preserves this, whereas randomisation
is almost guaranteed to be worse. We could attempt to optimise the edge
list, but sorting arbitrary lists would likely have worse overhead and
not significantly improve on most normal constructions.
Timing
------
Using the `wstate_n380.qasm` file from QASMBench[^2], we have the
following minimised benchmark reproducing the problem:
```python
from qiskit.circuit import QuantumCircuit
from qiskit.transpiler import (
generate_preset_pass_manager,
CouplingMap,
passes,
)
from qiskit.providers.fake_provider import GenericBackendV2
cmap = CouplingMap.from_full(380)
backend = GenericBackendV2(
cmap.size(),
coupling_map=cmap,
basis_gates=["id", "sx", "x", "rz", "cz"],
seed=42,
)
dag = QuantumCircuit.from_qasm_file("wstate_n380.qasm").to_dag()
pm = generate_preset_pass_manager(backend, seed_transpiler=42)
pass_ = passes.VF2Layout(
coupling_map=cmap,
seed=-1,
call_limit=(5_000_000, 10_000),
target=backend.target,
)
```
We time `pass_.run(dag)`. On 2.4.2, this takes about 250ms on my
machine. On 2.5.0rc1, it is about 550ms. This commit reverts the
timing back to durations statistically compatible with 2.4.2.
Looking at
```
perf -e cache-misses -- python bench.py
```
the baseline is ~70M cache misses, then with 10 loops of `pass_.run` at
the end, we find
- 150M with 2.4.2
- 340M with 2.5.0rc1
- 140M with this patch
Correcting for the baseline, this means 2.5.0rc1 has 3-4x the cache-miss
rate on this benchmark, and this patch restores the previous rate.
[^1]: 61e3ca0: Consolidate `Target` mappings (Qiskit#15349)
[^2]: https://github.com/pnnl/QASMBench/blob/357b942396d5c2b7cbc1c229c585a6ef5ccaebac/large/wstate_n380/wstate_n380.qasm
d1617c3 to
83c55d8
Compare
|
Force pushed just to include some more information in the commit message about the reason for the performance changes: it's the cache misses, with numbers. |
mtreinish
left a comment
There was a problem hiding this comment.
This LGTM, thanks for digging into this. I've confirmed the regression is fixed and looking at hardware counters on cache hit rate are similar showing data that with this PR it's fixing the access patterns for better locality and less cache missing.
|
Tick the box to add this pull request to the merge queue (same as
|
Fix cache and prediction misses in
Target::qargsIn 2.5.0rc1 we noticed a significant slowdown in VF2-dominated all-to-all connectivity transpilation benchmarks, which this fixes.
Background
We recently changed the internal hash-map data structures in the
Targetto consolidate various properties, and avoidIndexSettracking overhead in the qargs tracking1; it wasn't generally needed for determinism. However, randomising the iteration order of theQargsmeans that graphs constructed from them (likeTarget::coupling_graphor VF2's custom graph build) add their edges in random orders. Edge and neighbour search/iteration methods on graphs involve following a linked-list-like edge list, which is highly susceptible to cache and branch-prediction problems; it's far faster if these accesses are predictable. For our purposes here with all-to-all targets, it's the cache properties that matter, and the branch-prediction is about the same.Swapping the
qargs_gate_mapback toIndexSetdoes not itself enforce structure in the edge list, but in practice, aTargetwill be constructed programmatically, and there will be some logical structure in the construction.IndexMappreserves this, whereas randomisation is almost guaranteed to be worse. We could attempt to optimise the edge list, but sorting arbitrary lists would likely have worse overhead and not significantly improve on most normal constructions.Timing
Using the
wstate_n380.qasmfile from QASMBench2, we have the following minimised benchmark reproducing the problem:We time
pass_.run(dag). On 2.4.2, this takes about 250ms on my machine. On 2.5.0rc1, it is about 550ms. This commit reverts the timing back to durations statistically compatible with 2.4.2.Looking at
the baseline is ~70M cache misses, then with 10 loops of
pass_.runat the end, we findCorrecting for the baseline, this means 2.5.0rc1 has 3-4x the cache-miss rate on this benchmark, and this patch restores the previous rate.
AI/LLM disclosure
Footnotes
61e3ca0: Consolidate
Targetmappings (ConsolidateTargetmappings #15349) ↩https://github.com/pnnl/QASMBench/blob/357b942396d5c2b7cbc1c229c585a6ef5ccaebac/large/wstate_n380/wstate_n380.qasm ↩