|
| 1 | +# Parallel and Multi-Device Execution — Design Notes |
| 2 | + |
| 3 | +## Two User Goals |
| 4 | + |
| 5 | +Users reach for multi-device execution for one of two distinct reasons: |
| 6 | + |
| 7 | +### Goal 1: Speed — "Run my circuits faster" |
| 8 | + |
| 9 | +The user has many circuits to execute and wants to reduce total wall-clock time. The solution is to run multiple circuits simultaneously on separate execution targets. |
| 10 | + |
| 11 | +- **Qiskit**: Map multiple circuits onto disjoint qubit regions of a single large QPU. A 156-qubit device running 20-qubit circuits can execute ~6 simultaneously. |
| 12 | +- **CUDA-Q**: Distribute circuits across multiple GPUs via MPI. Each GPU runs a different circuit independently. |
| 13 | + |
| 14 | +In both cases, the user just wants "go parallel." The implementation details (qubit mapping vs GPU distribution) are handled by the execution engine. |
| 15 | + |
| 16 | +If the system can't actually parallelize (Qiskit device too small, CUDA-Q without MPI or only 1 GPU), execution falls back to sequential automatically — not an error, just an informational message. |
| 17 | + |
| 18 | +### Goal 2: Scale — "Run larger circuits" |
| 19 | + |
| 20 | +The user has circuits that are too wide for a single device and needs more qubits (or more GPU memory). The solution is to distribute the statevector across multiple devices. |
| 21 | + |
| 22 | +- **Qiskit**: Not currently applicable (QPUs have fixed qubit counts; circuit cutting is a different approach). |
| 23 | +- **CUDA-Q**: The statevector is partitioned and distributed across multiple GPUs (NVIDIA's mgpu backend). 4 GPUs with 32GB each give the memory of 128GB, adding ~2 qubits of capacity. |
| 24 | + |
| 25 | +This is NOT parallelization — only one circuit runs at a time. The statevector is distributed across GPUs to fit a problem too large for any single device. |
| 26 | + |
| 27 | +## Why "Parallel" Is Confusing |
| 28 | + |
| 29 | +The CUDA-Q multi-GPU statevector distribution is sometimes called "parallel" because the GPUs work together simultaneously. But from the user's perspective, it's the opposite of parallel execution: |
| 30 | + |
| 31 | +| | Goal 1: Speed | Goal 2: Scale | |
| 32 | +|---|---|---| |
| 33 | +| Circuits running simultaneously | Multiple | One | |
| 34 | +| Devices per circuit | One | Multiple | |
| 35 | +| User wants | Faster completion | Bigger problems | |
| 36 | +| CUDA-Q mechanism | MPI circuit distribution (mqpu) | MPI statevector distribution (mgpu) | |
| 37 | +| Qiskit mechanism | Qubit mapping | N/A | |
| 38 | + |
| 39 | +We use "parallel" exclusively for Goal 1 (speed). For Goal 2, we use "distributed statevector" — aligning with NVIDIA's terminology for mgpu mode ("partition and distribute the state vector"). |
| 40 | + |
| 41 | +## CLI Interface (Design Discussion) |
| 42 | + |
| 43 | +### For Speed (Goal 1): `--parallel` / `-p` |
| 44 | + |
| 45 | +A simple flag that enables parallel circuit execution: |
| 46 | + |
| 47 | +```bash |
| 48 | +# Qiskit — maps circuits onto disjoint qubit regions |
| 49 | +python benchmark.py -a qiskit -p |
| 50 | + |
| 51 | +# CUDA-Q — distributes circuits across GPUs (requires MPI) |
| 52 | +mpirun -np 4 python -m mpi4py benchmark.py -a cudaq -p |
| 53 | +``` |
| 54 | + |
| 55 | +When `-p` is set: |
| 56 | +- Qiskit: `execute.parallel_execution = True` → routes to qubit-mapped execution |
| 57 | +- CUDA-Q: `execute.parallel_execution = True` → equivalent to `gpus_per_circuit=1`, distributes circuits across MPI ranks |
| 58 | + |
| 59 | +If parallelization isn't possible (insufficient qubits, no MPI, single GPU), execution proceeds sequentially with an informational message. |
| 60 | + |
| 61 | +### For Scale (Goal 2): `--gpus_per_circuit` / `-gpc` |
| 62 | + |
| 63 | +Controls how many GPUs participate in distributing the statevector per circuit (CUDA-Q only): |
| 64 | + |
| 65 | +```bash |
| 66 | +# All 4 GPUs distribute the statevector (maximum capacity) |
| 67 | +mpirun -np 4 python -m mpi4py benchmark.py -a cudaq -gpc 4 |
| 68 | + |
| 69 | +# 2 GPUs per statevector (2 circuits can run in parallel on 4 GPUs) |
| 70 | +mpirun -np 4 python -m mpi4py benchmark.py -a cudaq -gpc 2 |
| 71 | +``` |
| 72 | + |
| 73 | +Note: `-gpc 1` is equivalent to `-p` for CUDA-Q. |
| 74 | + |
| 75 | +### Interaction Between Flags |
| 76 | + |
| 77 | +| Flags | Behavior | |
| 78 | +|---|---| |
| 79 | +| (none) | Sequential. CUDA-Q with MPI defaults to distributed statevector (all GPUs per circuit). | |
| 80 | +| `-p` | Parallel circuit execution. Each device runs one circuit independently. | |
| 81 | +| `-gpc N` | N GPUs distribute the statevector per circuit. `N=1` is parallel; `N=total` is full distribution; between is hybrid. | |
| 82 | +| `-p -gpc N` | `-p` is redundant when `-gpc` is specified — `gpc` controls the mode precisely. | |
| 83 | + |
| 84 | +### Default MPI Behavior (CUDA-Q) |
| 85 | + |
| 86 | +When running under MPI without any flags, CUDA-Q defaults to distributed statevector mode (all GPUs contribute to one circuit). This is the "scale" mode, not the "speed" mode. Users who want speed must explicitly request `-p` or `-gpc 1`. |
| 87 | + |
| 88 | +This default exists because statevector distribution is the safer choice — it works for any circuit width. Parallel distribution requires that each circuit fits on a single GPU, which may not be true for large simulations. |
| 89 | + |
| 90 | +## Programmatic Interface |
| 91 | + |
| 92 | +```python |
| 93 | +import execute as ex |
| 94 | + |
| 95 | +# Goal 1: Speed — parallel execution |
| 96 | +ex.parallel_execution = True |
| 97 | +job_id, result = ex.execute_circuits(circuits, num_shots=1000) |
| 98 | + |
| 99 | +# Goal 1: Speed — parallel groups (different shot counts per group) |
| 100 | +ex.parallel_execution = True |
| 101 | +job_id, group_results = ex.execute_circuit_groups( |
| 102 | + circuit_groups, num_shots_list=[1000, 500, 200]) |
| 103 | + |
| 104 | +# Goal 2: Scale — distributed statevector (CUDA-Q only, via CLI or gpus_per_circuit) |
| 105 | +job_id, result = ex.execute_circuits( |
| 106 | + circuits, num_shots=1000, gpus_per_circuit=4) |
| 107 | +``` |
| 108 | + |
| 109 | +## Group-Level Execution |
| 110 | + |
| 111 | +`execute_circuit_groups()` executes groups of circuits where each group can have a different shot count. This is essential for workflows like Hamiltonian observable estimation, where Pauli commuting groups are measured with shots weighted by coefficient magnitude. |
| 112 | + |
| 113 | +```python |
| 114 | +# 3 groups, different shot counts |
| 115 | +circuit_groups = [ |
| 116 | + [circuit_a1, circuit_a2], # high-weight terms |
| 117 | + [circuit_b1], # medium-weight terms |
| 118 | + [circuit_c1, circuit_c2, circuit_c3], # low-weight terms |
| 119 | +] |
| 120 | +num_shots_list = [1000, 500, 200] |
| 121 | + |
| 122 | +job_id, group_results = ex.execute_circuit_groups( |
| 123 | + circuit_groups, num_shots_list=num_shots_list) |
| 124 | +``` |
| 125 | + |
| 126 | +When `parallel_execution` is True: |
| 127 | +- **CUDA-Q**: Groups distributed across GPUs. Each GPU processes its assigned groups sequentially, but multiple groups run simultaneously across GPUs. |
| 128 | +- **Qiskit**: Circuits from different groups (with the same shot count) composed onto disjoint qubit regions for simultaneous execution. |
| 129 | + |
| 130 | +When `parallel_execution` is False (or parallelization not available): |
| 131 | +- Groups execute sequentially, each group's circuits passed to `execute_circuits()`. |
| 132 | + |
| 133 | +## Qubit Width Considerations (Qiskit) |
| 134 | + |
| 135 | +For qubit-mapped parallel execution, the device must have enough qubits to hold multiple circuits simultaneously: |
| 136 | + |
| 137 | +- **>= 2x max circuit width**: Can parallelize (map 2+ circuits) |
| 138 | +- **>= 1x but < 2x**: Cannot parallelize — sequential fallback with informational message |
| 139 | +- **< 1x max circuit width**: Error — circuits too wide for the device |
| 140 | + |
| 141 | +Within a group, the widest circuit determines the qubit allocation for that group. Narrower circuits use a subset of the allocated region. |
| 142 | + |
| 143 | +## Current Implementation Status |
| 144 | + |
| 145 | +| Feature | Qiskit | CUDA-Q | |
| 146 | +|---|---|---| |
| 147 | +| Circuit-level parallel (`-p`) | Stub (sequential fallback) | Working via MPI | |
| 148 | +| Group-level parallel | Stub (sequential fallback) | Stub (circuit-level works within groups) | |
| 149 | +| Distributed statevector (`-gpc N`) | N/A | Working (default MPI behavior) | |
| 150 | +| `parallel_execution` flag | Yes | Yes | |
| 151 | +| `execute_circuit_groups()` | Yes | Yes | |
| 152 | + |
| 153 | +Qiskit parallel implementation (qubit mapping via ParallelExperiment) is in development. |
0 commit comments