You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For comparison, sequential allocation (qubits starting from 0) produced 0.32–0.56 fidelity, and the full error-aware partitioning approach took 183 seconds per run.
91
93
92
94
The lightweight approach achieves 83–98% of free-transpiler fidelity at negligible computational cost (~1-2 seconds). The key insight is that reading 2-qubit gate error rates from the backend target is essentially free, and even a simple scoring pass using this data dramatically improves partition quality. Scoring also considers subgraph diameter (lower = shorter SWAP paths) and internal edge count (more edges = better routing options), which helps avoid chain-shaped partitions on heavy-hex topologies.
93
95
94
96
For wider or deeper circuits where the transpiler may route through qubits outside the assigned partition, the system automatically falls back to pre-transpilation onto restricted coupling maps, trading some fidelity for guaranteed execution. See `doc/_design/parallel_partition_mapping_tech_note.md` for full technical details.
95
97
98
+
### Array Batching — Handling More Circuits Than Partitions
99
+
100
+
When the number of circuits exceeds the number of available partitions, the system uses **array batching**: circuits are distributed round-robin across partitions, and Qiskit's `ParallelExperiment` composes one wide circuit per "round" — all submitted as a single job.
101
+
102
+
For example, 30 circuits of width 8 on ibm_fez (11 partitions at gap=0):
103
+
- Each partition receives 2-3 circuits
104
+
- ParallelExperiment creates 3 composite circuits (one per round)
105
+
- All 3 composites are submitted as **one job** — one queue wait, one initialization
106
+
- Results are automatically decomposed back to 30 individual circuit results
107
+
108
+
This is significantly more efficient than submitting 30 individual jobs or even 3 separate parallel batches. In testing, Hamlib TFIM with 30 circuits of 8 qubits achieved comparable fidelity to sequential execution while using approximately **3x less billed execution time** (3 seconds vs 10 seconds on IBM Quantum).
109
+
96
110
## Distributed Statevector Execution — Run Larger Circuits
97
111
98
112
Distributed statevector execution partitions and distributes a single circuit's statevector across multiple GPUs, enabling simulation of circuits that are too large for any one device.
0 commit comments