Extend cosim-gpu from single MI300X GPU co-simulation (QEMU + gem5) to multi-GPU hive simulation with xGMI interconnect modeling (Path A), and eventually to super-node scale-out via SST Merlin network simulation (Path B). The implementation follows a progressive architecture: single-process multi-GPU instances for initial phases (2–4 GPUs), transitioning to multi-process architecture for full 8-GPU hive scale. Only the vfio-user cosim backend is supported for multi-GPU; the legacy socket backend is not extended. Path A and Path B can proceed in parallel after the 2-GPU xGMI link model is validated.
- gem5 Process Model: Progressive — single gem5 process with multiple GPU instances for Milestone 1–2 (up to ~4 GPUs); multi-process (one gem5 container per GPU with IPC) introduced at Milestone 3 for 8-GPU scale.
- Dependency Relaxation: SST Merlin integration (Milestone 4) depends only on the 2-GPU xGMI baseline from Milestone 2, NOT on the 8-GPU hive (Milestone 3). Milestones 3 and 4 can proceed in parallel.
- Backend Scope: Multi-GPU is implemented exclusively on the vfio-user backend. The legacy socket backend is not extended for multi-GPU.
- Bandwidth Targets: 128 GB/s per link and ~310–330 GB/s aggregate are configurable model parameters calibrated to real hardware specs, NOT hard pass/fail acceptance criteria. Acceptance is based on correct data transfer and configurable timing, not measured throughput numbers.
- Hardware Characterization: Real hardware benchmarking (TransferBench, RCCL) is recommended for model calibration but not a gating prerequisite for any milestone.
- SST PoC: Milestone 4 includes a lightweight PoC sub-milestone before full integration.
Following TDD philosophy, each criterion includes positive and negative tests for deterministic verification.
-
AC-1: gem5 configuration supports instantiating N independent MI300X GPU timing models in a single process
- Positive Tests (expected to PASS):
mi300_cosim.py --num-gpus 2creates 2 AMDGPUDevice instances, each with its own CU array, L1/L2 caches, and VRAMmi300_cosim.py --num-gpus 4creates 4 independent GPU instances without resource conflicts- Each GPU instance has a unique
gpuIdand separate PM4/SDMA engine mappings
- Negative Tests (expected to FAIL):
mi300_cosim.py --num-gpus 0is rejected with a validation error- Two GPU instances sharing the same VRAM shared memory region (must be independent
/dev/shm/mi300x-vram-{0..N}) - BAR address ranges overlapping between GPU instances
- Positive Tests (expected to PASS):
-
AC-2: QEMU exposes N PCI GPU endpoints via vfio-user, each connected to a separate gem5 GPU instance
- Positive Tests (expected to PASS):
- QEMU launches with N
-device vfio-user-pci,socket=<path-N>parameters, each connecting to the corresponding gem5 vfio-user server - Guest
lspcishows N AMD GPU devices with correct vendor/device IDs (1002:74a0 or equivalent) - Each GPU's BAR layout (BAR0+1=VRAM, BAR2+3=Doorbell, BAR4=MSI-X, BAR5=MMIO) is independently mapped and non-overlapping
- QEMU launches with N
- Negative Tests (expected to FAIL):
- QEMU startup with a socket path that doesn't correspond to any gem5 GPU instance (connection refused)
- Guest accessing BAR region of GPU 0 and seeing data from GPU 1 (isolation violation)
- Positive Tests (expected to PASS):
-
AC-3: Per-GPU shared memory regions are independently allocated and accessible
- Positive Tests (expected to PASS):
/dev/shm/mi300x-vram-0,/dev/shm/mi300x-vram-1, ... exist as separate files with correct sizes- Guest writes to GPU 0 VRAM do not appear in GPU 1 VRAM
- Each GPU's VRAM is independently accessible from both QEMU and gem5
- Negative Tests (expected to FAIL):
- Writing to GPU 0 VRAM region corrupts GPU 1 VRAM content
- Positive Tests (expected to PASS):
-
AC-4: Launch infrastructure supports multi-GPU configuration
- Positive Tests (expected to PASS):
cosim_launch.sh --num-gpus 2starts gem5 with 2 GPU instances and QEMU with 2 vfio-user-pci devicescosim_launch.sh --num-gpus 1(or default) behaves identically to current single-GPU setup (backward compatible)
- Negative Tests (expected to FAIL):
cosim_launch.sh --num-gpus 2with insufficient shared memory allocation
- Positive Tests (expected to PASS):
-
AC-5: Guest OS initializes all GPU instances via updated setup service
- Positive Tests (expected to PASS):
cosim-gpu-setup.serviceiterates ROM DD + modprobe for each of N GPUsamdgpudriver successfully probes all N GPU devices- Each GPU appears in
/sys/class/drm/as separate render nodes
- Negative Tests (expected to FAIL):
- ROM loaded to only one GPU while N>1 GPUs are present (remaining GPUs fail to initialize)
- Positive Tests (expected to PASS):
-
AC-6: Each GPU instance can independently execute a compute kernel
- Positive Tests (expected to PASS):
- A simple HIP vector-add kernel runs on GPU 0 and produces correct results
- The same kernel runs on GPU 1 and produces correct results independently
- Two kernels running on different GPUs concurrently do not interfere
- Negative Tests (expected to FAIL):
- Kernel dispatched to GPU 0 executes on GPU 1's CU array
- Positive Tests (expected to PASS):
-
AC-7: xGMI bridge component exists in gem5, attached to each GPU's L2 cache egress
- Positive Tests (expected to PASS):
- xGMI bridge SimObject is instantiated and connected to GPU's L2 cache controller
- Bridge accepts memory requests targeting remote GPU address ranges
- Bridge forwards packets with correct (src_gpu, dst_gpu, addr, size, payload) header
- Negative Tests (expected to FAIL):
- Memory request to local GPU VRAM being routed through xGMI bridge (local access should not traverse xGMI)
- xGMI bridge forwarding packets with incorrect src_gpu/dst_gpu fields
- Positive Tests (expected to PASS):
-
AC-8: xGMI link parameters are configurable at launch time
- Positive Tests (expected to PASS):
--xgmi-bandwidth 128GBps --xgmi-latency 100nsconfigures the link model accordingly--xgmi-topology meshcreates full-mesh connectivity between GPU instances--xgmi-topology ringcreates ring connectivity- Default parameters match MI300X hardware specs (128 GB/s, 16 lanes)
- Negative Tests (expected to FAIL):
--xgmi-bandwidth 0is rejected (invalid parameter)--xgmi-topology staris rejected (unsupported topology)
- Positive Tests (expected to PASS):
-
AC-9: GPU-to-GPU VRAM data transfer works correctly through xGMI
- Positive Tests (expected to PASS):
- GPU 0 writes a known pattern to GPU 1 VRAM via xGMI path; GPU 1 reads back the correct data
- Bidirectional transfer: GPU 0→GPU 1 and GPU 1→GPU 0 simultaneously without corruption
- Negative Tests (expected to FAIL):
- Transfer to non-existent GPU ID (should return error/timeout, not silent corruption)
- Data arriving at wrong address on destination GPU
- Positive Tests (expected to PASS):
-
AC-10: Flow control prevents data loss under congestion
- Positive Tests (expected to PASS):
- Credit-based back-pressure stalls sender when receiver buffer is full
- After back-pressure release, queued packets are delivered in order
- Negative Tests (expected to FAIL):
- Packets dropped silently when link is congested (must stall, not drop)
- Positive Tests (expected to PASS):
-
AC-11: 8-GPU full-mesh xGMI topology operates correctly with multi-process architecture
- Positive Tests (expected to PASS):
- 8 gem5 processes (one per GPU) connect via IPC, forming 28 bidirectional xGMI links
- Any GPU can access any other GPU's VRAM through xGMI
- Global clock synchronization keeps all GPU timing models consistent
- Negative Tests (expected to FAIL):
- GPU-to-GPU transfer across 2+ hops in mesh (mesh is fully connected, all transfers are single-hop)
- Timing inconsistency: GPU 0 observes event at simulated time T1, GPU 1 observes same event at different simulated time
- Positive Tests (expected to PASS):
-
AC-12: SDMA engine supports xGMI copy path for GPU-to-GPU DMA
- Positive Tests (expected to PASS):
- SDMA copy command with remote GPU destination correctly transfers data via xGMI
- SDMA completion interrupt fires after xGMI transfer finishes
- Negative Tests (expected to FAIL):
- SDMA xGMI copy targeting local VRAM uses xGMI path instead of local path
- Positive Tests (expected to PASS):
-
AC-13: Multi-GPU workloads execute correctly across the hive
- Positive Tests (expected to PASS):
- RCCL allreduce across 8 GPUs produces mathematically correct results
- RCCL allgather collects data from all 8 GPUs correctly
- Negative Tests (expected to FAIL):
- Collective operation completes with incorrect data on any participating GPU
- Positive Tests (expected to PASS):
-
AC-14: Documentation covers xGMI model design
- Positive Tests (expected to PASS):
docs/en/xgmi-model.mdanddocs/zh/xgmi-model.mdexist with mutual cross-links- Documentation covers packet format, topology configuration, and calibration parameters
- Negative Tests (expected to FAIL):
- Documentation exists in only one language (must have both
en/andzh/versions)
- Documentation exists in only one language (must have both
- Positive Tests (expected to PASS):
-
AC-15: Lightweight PoC demonstrates gem5 GPU ↔ SST basic communication
- Positive Tests (expected to PASS):
- A single gem5 GPU instance sends a test message through an SST Merlin network endpoint and receives a response
- SST simulation event loop and gem5 event scheduler co-execute without deadlock
- Negative Tests (expected to FAIL):
- gem5 and SST event loops diverge in simulated time by more than one synchronization quantum
- Positive Tests (expected to PASS):
-
AC-16: SST wrapper component encapsulates gem5 GPU timing model as SST endpoint
- Positive Tests (expected to PASS):
- gem5 GPU model registers as an SST component with correct network interface
- SST Merlin routes packets between two gem5 GPU endpoints
- Negative Tests (expected to FAIL):
- SST component initialization fails due to missing gem5 library dependencies
- Positive Tests (expected to PASS):
-
AC-17: Three-layer synchronization protocol coordinates QEMU, gem5, and SST
- Positive Tests (expected to PASS):
- QEMU (real-time KVM) → gem5 (GPU timing) → SST (network timing) chain maintains causal ordering
- Guest driver initiates GPU-to-GPU transfer; timing flows through all three layers correctly
- Negative Tests (expected to FAIL):
- SST network event processed before corresponding gem5 GPU event (causality violation)
- Positive Tests (expected to PASS):
-
AC-18: 2-GPU SST Merlin path produces comparable results to Path A xGMI model
- Positive Tests (expected to PASS):
- Same GPU-to-GPU transfer test produces identical data results through SST Merlin vs. Path A xGMI
- SST Merlin configured with equivalent parameters (128 GB/s, matching latency) shows similar timing behavior
- Negative Tests (expected to FAIL):
- Data corruption when routing through SST Merlin (functional correctness must match Path A)
- Positive Tests (expected to PASS):
-
AC-19: Multi-node topology support via SST Merlin
- Positive Tests (expected to PASS):
- Fat-tree topology with 2 nodes (each containing 8 GPUs) routes inter-node traffic correctly
- Dragonfly topology configuration accepted and functional
- Negative Tests (expected to FAIL):
- Inter-node packet routed within a single node (must traverse NIC model)
- Positive Tests (expected to PASS):
-
AC-20: Hybrid intra-node xGMI + inter-node Ethernet topology operates correctly
- Positive Tests (expected to PASS):
- Intra-node GPU communication uses xGMI path (low latency)
- Inter-node GPU communication uses NIC/Ethernet path (higher latency)
- RCCL collective across nodes produces correct results
- Negative Tests (expected to FAIL):
- Inter-node traffic bypassing NIC model and using xGMI directly
- Positive Tests (expected to PASS):
Path boundaries define the acceptable range of implementation quality and choices.
The implementation includes:
- Full 8-GPU hive with multi-process gem5 architecture, global time synchronization, and 28 bidirectional xGMI links with credit-based flow control
- Complete SST Merlin integration with three-layer co-simulation synchronization, supporting both fat-tree and dragonfly topologies
- Multi-node scale-out (up to 8 nodes) with hybrid xGMI + Ethernet interconnect and RCCL collective benchmarking
- Simplified Atlas switch model for cross-chassis xGMI extension
- Performance profiling and bottleneck analysis tooling
- Full bilingual documentation (en + zh)
- Integration with job scheduler / workload manager for multi-node runs
The implementation includes:
- 2-GPU single-process co-simulation with independent VRAM, correct PCI enumeration, and per-GPU kernel execution (Milestone 1)
- Basic xGMI link model with configurable bandwidth/latency between 2 GPUs, functional data transfer verified (Milestone 2)
- One of: 8-GPU hive (Milestone 3) OR SST Merlin 2-GPU PoC (Milestone 4 PoC sub-milestone)
- Can use:
- vfio-user protocol for QEMU ↔ gem5 communication (required)
- POSIX shared memory for per-GPU VRAM regions
- Unix domain sockets or shared memory ring buffers for xGMI transport (single-process)
- IPC (Unix sockets, shared memory, or MPI) for multi-process GPU-to-GPU communication
- gem5 Ruby cache protocol extensions for L2 egress bridge
- gem5 SimpleNetwork or custom network model for xGMI links
- SST Merlin for network topology simulation (Milestone 4+)
- Docker containers for gem5 process isolation
- Cannot use:
- Legacy cosim socket backend for multi-GPU (single-GPU legacy remains untouched)
- gem5 GarnetNetwork for xGMI (designed for on-chip, not inter-chip)
- Modifications to upstream QEMU source code for multi-device support (use QEMU CLI parameters only)
- Hardcoded GPU count assumptions (must be parameterized via
--num-gpus)
Note: This section is for reference and understanding only. These are conceptual suggestions, not prescriptive requirements.
Milestone 1 — Multi-GPU in single gem5 process:
# Pseudocode for mi300_cosim.py extension
for gpu_id in range(args.num_gpus):
gpu_device = AMDGPUDevice(gpu_id=gpu_id)
shader = createGPU(system, args, gpu_id) # CU/L1/L2 per GPU
vram_shmem = f"/mi300x-vram-{gpu_id}"
socket_path = f"{args.socket_base}-{gpu_id}.sock"
cosim_bridge = MI300XVfioUser(
gpu_device=gpu_device,
socket_path=socket_path,
shmem_path=vram_shmem,
vram_size=args.dgpu_mem_size,
)
system.gpu_devices.append(gpu_device)
system.cosim_bridges.append(cosim_bridge)
QEMU side: add N -device vfio-user-pci,socket=/tmp/gem5-mi300x-{N}.sock parameters in cosim_launch.sh.
Milestone 2 — xGMI bridge at L2 egress:
The xGMI bridge intercepts memory requests at L2 cache egress. When the target address falls in a remote GPU's VRAM range, the request is forwarded through the xGMI link model instead of local memory.
L2 Cache → [address check] → local VRAM (if local)
→ xGMI Bridge → Transport → Remote GPU L2 → Remote VRAM (if remote)
Milestone 3 — Multi-process transition:
Each GPU runs in a separate gem5 Docker container. A synchronization daemon manages global virtual time across all gem5 processes. xGMI transport uses shared memory ring buffers or Unix sockets for inter-process packet delivery.
Milestone 4 — SST PoC approach:
Start with a minimal integration: wrap a single gem5 GPU as an SST SubComponent, connect to a trivial Merlin network (2 endpoints, single link). Validate that SST's event loop and gem5's event scheduler can co-execute without deadlock or time drift. Only after PoC success, proceed to full 2-GPU integration with three-layer synchronization.
gem5/configs/example/gpufs/mi300_cosim.py— Current single-GPU cosim configuration, the primary file to extend for multi-GPUgem5/src/dev/amdgpu/mi300x_vfio_user.hh/.cc— vfio-user cosim bridge implementation, one instance needed per GPUgem5/src/dev/amdgpu/amdgpu_device.hh— AMDGPUDevice with gpuId, PM4/SDMA mappings; instantiate N timesgem5/src/dev/amdgpu/MI300XVfioUser.py— SimObject definition for vfio-user bridgescripts/cosim_launch.sh— Launch orchestration; add--num-gpusparameter and multi-socket logicscripts/cosim_guest_setup.sh— Guest GPU init; iterate over N GPUs for ROM DD + modprobegem5/src/mem/ruby/— Ruby cache protocol; L2 egress is the xGMI bridge attachment point- MGPUSim — Multi-GPU simulator reference for pluggable interconnect design
- SST + Balar — GPU integration in SST framework
- gem5 + SST — gem5 as SST component, synchronization patterns
-
Milestone 1: Multi-GPU Instance Infrastructure (single-process, no interconnect)
- Section A: gem5 config extension — parameterize
mi300_cosim.pyfor N GPU instances with separate VRAM/PM4/SDMA - Section B: QEMU + launch scripts — multi-socket vfio-user launch, per-GPU shared memory allocation
- Section C: Guest initialization — update
cosim-gpu-setup.servicefor N-GPU ROM DD + modprobe iteration - Section D: Validation — PCI enumeration, BAR isolation, independent kernel execution per GPU
- Section A: gem5 config extension — parameterize
-
Milestone 2: xGMI Link Model (2-GPU hive, single-process)
- Section A: xGMI specification — packet format, addressing scheme, topology configuration interface
- Section B: gem5 bridge implementation — L2 egress bridge, xGMI transport (in-process), flow control
- Section C: Launch integration —
--xgmi-topologyand--xgmi-bandwidthparameters - Section D: Validation — bidirectional VRAM transfer, data correctness, configurable timing verification
-
Milestone 3: Full 8-GPU Hive (multi-process architecture, Path A complete)
- Section A: Multi-process architecture — per-GPU gem5 container, IPC transport for xGMI, global time synchronization
- Section B: SDMA xGMI copy path — GPU-to-GPU DMA through SDMA engine
- Section C: Workload validation — RCCL collective communication (allreduce, allgather)
- Section D: Optional extensions — simplified Atlas switch model, performance profiling tooling
- Section E: Documentation — bilingual xGMI model design docs
-
Milestone 4: SST Merlin Integration (Path B foundation)
- Section A: PoC — gem5 single-GPU as SST SubComponent, minimal Merlin network, event loop co-execution validation
- Section B: Full integration — SST wrapper for gem5 GPU, three-layer synchronization protocol (QEMU ↔ gem5 ↔ SST)
- Section C: Validation — 2-GPU SST Merlin path, comparison with Milestone 2 xGMI baseline
- Section D: Benchmarking — co-sim overhead analysis vs. direct transport
-
Milestone 5: Super-Node Scale-Out (Path B complete)
- Section A: Multi-node topology — fat-tree, dragonfly via SST Merlin
- Section B: NIC model — Ultra Ethernet / RoCE for inter-node path
- Section C: Hybrid topology — intra-node xGMI mesh + inter-node Ethernet
- Section D: Validation — multi-node RCCL collectives, scalability testing (2/4/8-node)
Milestone 1 (Multi-GPU Instances)
│
v
Milestone 2 (xGMI 2-GPU Link Model)
│
├──────────────────────────┐
v v
Milestone 3 (8-GPU Hive) Milestone 4 (SST Merlin)
│ │
│ ┌──────────────────────┘
v v
Milestone 5 (Super-Node Scale-Out)
Key dependency changes from original draft:
- Milestone 4 depends on Milestone 2 only (relaxed from original Phase 3→4 dependency)
- Milestone 3 and Milestone 4 can proceed in parallel
- Milestone 5 depends on both Milestone 3 (8-GPU hive validated) and Milestone 4 (SST integration proven)
- Milestone 1 Section A (gem5 multi-instance) must complete before Section B (QEMU multi-socket)
- Milestone 2 Section A (xGMI spec) must complete before Section B (bridge implementation)
- Milestone 3 Section A (multi-process architecture) is the highest-risk item — it changes the fundamental execution model
- Milestone 4 Section A (PoC) is a gate: if PoC fails, the full SST integration approach needs re-evaluation
- Implementation code and comments must NOT contain plan-specific terminology such as "AC-", "Milestone", "Step", "Phase", or similar workflow markers
- These terms are for plan documentation only, not for the resulting codebase
- Use descriptive, domain-appropriate naming in code instead
- The transition from single-process to multi-process (Milestone 3) is the most significant architectural change. The single-process model from Milestones 1–2 should be designed with this transition in mind — e.g., xGMI transport should use an abstract interface that can switch between in-process function calls and IPC
- PCI topology: QEMU Q35 chipset has limited root port count. For 8 GPUs, PCIe switches or additional root ports may be needed. Investigate Q35 capacity early in Milestone 1
- amdgpu driver multi-GPU: the cosim environment uses
ip_block_mask=0x67which disables PSP and SMU. Verify that xGMI topology discovery works without these blocks, or implement a minimal xGMI discovery stub - Guest RAM shared memory (
/dev/shm/cosim-guest-ram) is shared across all GPUs; only VRAM is per-GPU