Skip to content

Commit 0d823e2

Browse files
committed
Add mixed-width parallel execution (Phase 3) with simulator max qubits setting
- _find_multi_width_partitions(): round-robin across widths, largest first - _group_circuits_by_width(): sort circuits into width groups - _pad_circuit(): add idle qubits for smaller circuits in larger partitions - parallel_simulator_max_qubits (default 16): caps simulator qubit budget - Simulator uses spacing=0 (no crosstalk on simulators) - Single code path: same-width degenerates to Phase 2 naturally - Tested on simulator: same-width 8q, mixed 6/7/8q with padding - Added flow diagram: doc/_design/parallel_execution_flow_diagram.md
1 parent 8134c30 commit 0d823e2

5 files changed

Lines changed: 1379 additions & 54 deletions

File tree

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# Parallel and Multi-Device Execution — Design Notes
2+
3+
## Two User Goals
4+
5+
Users reach for multi-device execution for one of two distinct reasons:
6+
7+
### Goal 1: Speed — "Run my circuits faster"
8+
9+
The user has many circuits to execute and wants to reduce total wall-clock time. The solution is to run multiple circuits simultaneously on separate execution targets.
10+
11+
- **Qiskit**: Map multiple circuits onto disjoint qubit regions of a single large QPU. A 156-qubit device running 20-qubit circuits can execute ~6 simultaneously.
12+
- **CUDA-Q**: Distribute circuits across multiple GPUs via MPI. Each GPU runs a different circuit independently.
13+
14+
In both cases, the user just wants "go parallel." The implementation details (qubit mapping vs GPU distribution) are handled by the execution engine.
15+
16+
If the system can't actually parallelize (Qiskit device too small, CUDA-Q without MPI or only 1 GPU), execution falls back to sequential automatically — not an error, just an informational message.
17+
18+
### Goal 2: Scale — "Run larger circuits"
19+
20+
The user has circuits that are too wide for a single device and needs more qubits (or more GPU memory). The solution is to distribute the statevector across multiple devices.
21+
22+
- **Qiskit**: Not currently applicable (QPUs have fixed qubit counts; circuit cutting is a different approach).
23+
- **CUDA-Q**: The statevector is partitioned and distributed across multiple GPUs (NVIDIA's mgpu backend). 4 GPUs with 32GB each give the memory of 128GB, adding ~2 qubits of capacity.
24+
25+
This is NOT parallelization — only one circuit runs at a time. The statevector is distributed across GPUs to fit a problem too large for any single device.
26+
27+
## Why "Parallel" Is Confusing
28+
29+
The CUDA-Q multi-GPU statevector distribution is sometimes called "parallel" because the GPUs work together simultaneously. But from the user's perspective, it's the opposite of parallel execution:
30+
31+
| | Goal 1: Speed | Goal 2: Scale |
32+
|---|---|---|
33+
| Circuits running simultaneously | Multiple | One |
34+
| Devices per circuit | One | Multiple |
35+
| User wants | Faster completion | Bigger problems |
36+
| CUDA-Q mechanism | MPI circuit distribution (mqpu) | MPI statevector distribution (mgpu) |
37+
| Qiskit mechanism | Qubit mapping | N/A |
38+
39+
We use "parallel" exclusively for Goal 1 (speed). For Goal 2, we use "distributed statevector" — aligning with NVIDIA's terminology for mgpu mode ("partition and distribute the state vector").
40+
41+
## CLI Interface (Design Discussion)
42+
43+
### For Speed (Goal 1): `--parallel` / `-p`
44+
45+
A simple flag that enables parallel circuit execution:
46+
47+
```bash
48+
# Qiskit — maps circuits onto disjoint qubit regions
49+
python benchmark.py -a qiskit -p
50+
51+
# CUDA-Q — distributes circuits across GPUs (requires MPI)
52+
mpirun -np 4 python -m mpi4py benchmark.py -a cudaq -p
53+
```
54+
55+
When `-p` is set:
56+
- Qiskit: `execute.parallel_execution = True` → routes to qubit-mapped execution
57+
- CUDA-Q: `execute.parallel_execution = True` → equivalent to `gpus_per_circuit=1`, distributes circuits across MPI ranks
58+
59+
If parallelization isn't possible (insufficient qubits, no MPI, single GPU), execution proceeds sequentially with an informational message.
60+
61+
### For Scale (Goal 2): `--gpus_per_circuit` / `-gpc`
62+
63+
Controls how many GPUs participate in distributing the statevector per circuit (CUDA-Q only):
64+
65+
```bash
66+
# All 4 GPUs distribute the statevector (maximum capacity)
67+
mpirun -np 4 python -m mpi4py benchmark.py -a cudaq -gpc 4
68+
69+
# 2 GPUs per statevector (2 circuits can run in parallel on 4 GPUs)
70+
mpirun -np 4 python -m mpi4py benchmark.py -a cudaq -gpc 2
71+
```
72+
73+
Note: `-gpc 1` is equivalent to `-p` for CUDA-Q.
74+
75+
### Interaction Between Flags
76+
77+
| Flags | Behavior |
78+
|---|---|
79+
| (none) | Sequential. CUDA-Q with MPI defaults to distributed statevector (all GPUs per circuit). |
80+
| `-p` | Parallel circuit execution. Each device runs one circuit independently. |
81+
| `-gpc N` | N GPUs distribute the statevector per circuit. `N=1` is parallel; `N=total` is full distribution; between is hybrid. |
82+
| `-p -gpc N` | `-p` is redundant when `-gpc` is specified — `gpc` controls the mode precisely. |
83+
84+
### Default MPI Behavior (CUDA-Q)
85+
86+
When running under MPI without any flags, CUDA-Q defaults to distributed statevector mode (all GPUs contribute to one circuit). This is the "scale" mode, not the "speed" mode. Users who want speed must explicitly request `-p` or `-gpc 1`.
87+
88+
This default exists because statevector distribution is the safer choice — it works for any circuit width. Parallel distribution requires that each circuit fits on a single GPU, which may not be true for large simulations.
89+
90+
## Programmatic Interface
91+
92+
```python
93+
import execute as ex
94+
95+
# Goal 1: Speed — parallel execution
96+
ex.parallel_execution = True
97+
job_id, result = ex.execute_circuits(circuits, num_shots=1000)
98+
99+
# Goal 1: Speed — parallel groups (different shot counts per group)
100+
ex.parallel_execution = True
101+
job_id, group_results = ex.execute_circuit_groups(
102+
circuit_groups, num_shots_list=[1000, 500, 200])
103+
104+
# Goal 2: Scale — distributed statevector (CUDA-Q only, via CLI or gpus_per_circuit)
105+
job_id, result = ex.execute_circuits(
106+
circuits, num_shots=1000, gpus_per_circuit=4)
107+
```
108+
109+
## Group-Level Execution
110+
111+
`execute_circuit_groups()` executes groups of circuits where each group can have a different shot count. This is essential for workflows like Hamiltonian observable estimation, where Pauli commuting groups are measured with shots weighted by coefficient magnitude.
112+
113+
```python
114+
# 3 groups, different shot counts
115+
circuit_groups = [
116+
[circuit_a1, circuit_a2], # high-weight terms
117+
[circuit_b1], # medium-weight terms
118+
[circuit_c1, circuit_c2, circuit_c3], # low-weight terms
119+
]
120+
num_shots_list = [1000, 500, 200]
121+
122+
job_id, group_results = ex.execute_circuit_groups(
123+
circuit_groups, num_shots_list=num_shots_list)
124+
```
125+
126+
When `parallel_execution` is True:
127+
- **CUDA-Q**: Groups distributed across GPUs. Each GPU processes its assigned groups sequentially, but multiple groups run simultaneously across GPUs.
128+
- **Qiskit**: Circuits from different groups (with the same shot count) composed onto disjoint qubit regions for simultaneous execution.
129+
130+
When `parallel_execution` is False (or parallelization not available):
131+
- Groups execute sequentially, each group's circuits passed to `execute_circuits()`.
132+
133+
## Qubit Width Considerations (Qiskit)
134+
135+
For qubit-mapped parallel execution, the device must have enough qubits to hold multiple circuits simultaneously:
136+
137+
- **>= 2x max circuit width**: Can parallelize (map 2+ circuits)
138+
- **>= 1x but < 2x**: Cannot parallelize — sequential fallback with informational message
139+
- **< 1x max circuit width**: Error — circuits too wide for the device
140+
141+
Within a group, the widest circuit determines the qubit allocation for that group. Narrower circuits use a subset of the allocated region.
142+
143+
## Current Implementation Status
144+
145+
| Feature | Qiskit | CUDA-Q |
146+
|---|---|---|
147+
| Circuit-level parallel (`-p`) | Stub (sequential fallback) | Working via MPI |
148+
| Group-level parallel | Stub (sequential fallback) | Stub (circuit-level works within groups) |
149+
| Distributed statevector (`-gpc N`) | N/A | Working (default MPI behavior) |
150+
| `parallel_execution` flag | Yes | Yes |
151+
| `execute_circuit_groups()` | Yes | Yes |
152+
153+
Qiskit parallel implementation (qubit mapping via ParallelExperiment) is in development.

0 commit comments

Comments
 (0)