Reproduce the full-system GPU simulation of AMD Instinct MI300X on the cosim branch from scratch,
until the square test passes.
| Requirement | Description |
|---|---|
| Host OS | Linux x86_64 with KVM support (verified on WSL2 6.6.x) |
| Docker | Daemon running, current user in docker group |
| KVM | /dev/kvm accessible (required for both disk image build and simulation) |
| QEMU | qemu-system-x86_64 installed (used by Packer to build disk images) |
| Disk space | At least 120 GB free (55G disk image + build intermediates) |
| Tools | git, unzip, guestfish (optional, for disk image verification) |
| Image | Purpose |
|---|---|
ghcr.io/gem5/gpu-fs:latest |
Base image for gem5 runtime container (amd64) |
gem5-run:local |
Runtime image built from scripts/Dockerfile.run |
ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0 |
gem5 compilation (arm64 only, see note below) |
Note:
ghcr.io/gem5/ubuntu-24.04_all-dependencies:v24-0is arm64 only. On amd64 hosts, useghcr.io/gem5/gpu-fsas the build image or compile natively. You can override the default image by setting theGEM5_BUILD_IMAGEenvironment variable.
/home/zevorn/cosim/
gem5/ # gem5 source (cosim branch)
build/VEGA_X86/gem5.opt # gem5 binary
configs/example/
gem5_library/x86-mi300x-gpu.py # stdlib config
gpufs/mi300.py # legacy config
scripts/
run_mi300x_fs.sh # orchestration script
Dockerfile.run # runtime Docker image
gem5-resources/ # disk images, kernels, GPU apps
src/x86-ubuntu-gpu-ml/
disk-image/x86-ubuntu-rocm70 # 55G raw disk image
vmlinux-rocm70 # extracted kernel
src/gpu/square/ # square test app
docs/ # documentation
qemu/ # QEMU source (cosim device)
build/qemu-system-x86_64
cd /home/zevorn/cosim/gem5
./scripts/run_mi300x_fs.sh build-gem5This command runs scons build/VEGA_X86/gem5.opt inside Docker.
Output: build/VEGA_X86/gem5.opt (approximately 1.1 GB).
Manual build without Docker:
scons build/VEGA_X86/gem5.opt -j$(nproc)./scripts/run_mi300x_fs.sh build-qemuRequires QEMU source at ../qemu/. Configures with --target-list=x86_64-softmmu and builds.
Output: ../qemu/build/qemu-system-x86_64.
./scripts/run_mi300x_fs.sh build-disk
# If gem5-resources does not exist, it will be cloned automatically, then disk image build beginsOr clone manually:
cd /home/zevorn/cosim
git clone --depth 1 https://github.com/gem5/gem5-resources.git gem5-resourcesThe disk image build uses Packer + QEMU/KVM to install Ubuntu 24.04.2 + ROCm 7.0 + kernel 6.8.0-79-generic with all required DKMS modules.
./scripts/run_mi300x_fs.sh build-diskcd ../gem5-resources/src/x86-ubuntu-gpu-ml
# Download Packer and build
./build.sh -var "qemu_path=/usr/sbin/qemu-system-x86_64"Important: The default
qemu_pathinx86-ubuntu-gpu-ml.pkr.hclis/usr/bin/qemu-system-x86_64. Some distributions (e.g., Arch) install it at/usr/sbin/qemu-system-x86_64, which requires overriding with-var.
- Boot Ubuntu 24.04.2 ISO via QEMU/KVM for unattended installation
- Run
scripts/rocm-install.sh, which performs the following in order:- Compile and install the
m5tool from gem5 source (/sbin/m5) - Install ROCm 7.0 from
repo.radeon.com/amdgpu/7.0/ubuntu - Install
amdgpu-dkms(compile DKMS kernel modules) - Install kernel
6.8.0-79-genericand corresponding headers - Extract
vmlinuxkernel for gem5 use - Compile
gem5_wmi.ko(ACPI patch module) - Install PyTorch (ROCm 6.0 support)
- Compile and install the
- Copy GPU BIOS ROM (
mi300.rom), IP discovery files, and boot scripts into the image - Download the extracted kernel from the VM as
vmlinux-rocm70
| Artifact | Path | Size |
|---|---|---|
| Disk image | disk-image/x86-ubuntu-rocm70 |
~55 GB |
| Kernel | vmlinux-rocm70 |
~64 MB |
Approximately 30-60 minutes, depending on network speed and host performance.
Use guestfish to inspect disk image contents without mounting:
LIBGUESTFS_BACKEND=direct guestfish --ro \
-a disk-image/x86-ubuntu-rocm70 -m /dev/sda1 <<'EOF'
echo "=== DKMS modules ==="
ls /lib/modules/6.8.0-79-generic/updates/dkms/
echo "=== ROCm version ==="
cat /opt/rocm/.info/version
echo "=== load_amdgpu.sh ==="
cat /home/gem5/load_amdgpu.sh
echo "=== m5 binary ==="
is-file /sbin/m5
echo "=== gem5_wmi module ==="
is-file /home/gem5/gem5_wmi.ko
EOFExpected DKMS module list (all dependencies for the amdgpu driver):
amd-sched.ko.zst
amddrm_buddy.ko.zst
amddrm_exec.ko.zst # Critical module -- missing in older builds
amddrm_ttm_helper.ko.zst
amdgpu.ko.zst
amdkcl.ko.zst
amdttm.ko.zst
amdxcp.ko.zst
./scripts/run_mi300x_fs.sh build-app squareCompiles using Docker (ghcr.io/gem5/gpu-fs) or local hipcc.
Output: ../gem5-resources/src/gpu/square/bin.default/square.default.
The gem5 binary is linked against Ubuntu 24.04 libraries and requires a compatible runtime environment:
cd scripts
docker build -t gem5-run:local -f Dockerfile.run ../scripts/run_mi300x_fs.sh run \
../gem5-resources/src/gpu/square/bin.default/square.defaultImportant: The
--appparameter must be specified. Without it,readfile_contentsis an empty string"", which Python evaluates as falsy, soKernelDiskWorkload._set_readfile_contentsis never called, and the amdgpu driver in the guest is never loaded.
./scripts/run_mi300x_fs.sh run-legacy \
../gem5-resources/src/gpu/square/bin.default/square.default- KVM fast-boot phase (~2-5 minutes): gem5 uses KVM to fast-forward Linux boot. Guest kernel boots, systemd initializes, and auto-login as root occurs.
- readfile execution: The guest runs
/home/gem5/run_gem5_app.shvia.bashrc, which callsm5 readfileto retrieve the host-injected script. - Driver loading: The script writes the GPU BIOS ROM to
/dev/mem, creates symlinks for IP discovery files, then runsload_amdgpu.shto insmod all DKMS modules in dependency order. - GPU application execution: The script decodes the base64-encoded GPU binary, runs it,
then calls
m5 exitto end the simulation.
Guest serial console output is written to m5out/board.pc.com_1.device:
tail -f m5out/board.pc.com_1.device3+0 records in
3+0 records out
3072 bytes (3.1 kB, 3.0 KiB) copied, ...
info: running on device AMD Instinct MI300X
info: allocate host and device mem ( 7.63 MB)
info: launch 'vector_square' kernel
info: check result
PASSED!
Root cause: The disk image is missing the amddrm_exec.ko.zst DKMS module. Without this module,
the amdgpu TTM memory manager fails to initialize, drm_dev_enter() finds the device in an
"unplugged" state, and returns -EACCES (-13). The subsequent cleanup path triggers a NULL pointer
dereference in ttm_resource_move_to_lru_tail.
Fix: Rebuild the disk image using the latest gem5-resources (origin/stable branch).
The updated rocm-install.sh installs kernel 6.8.0-79-generic, which fully matches
the ROCm 7.0 DKMS packages and includes all required modules.
Verification: Use guestfish to confirm that amddrm_exec.ko.zst exists in
/lib/modules/6.8.0-79-generic/updates/dkms/.
Harmless warning. The m5 tool first attempts the gem5_bridge device driver, and falls back to
address-mapped MMIO mode (available when running as root) on failure. The readfile mechanism
still works correctly.
A leftover disk-image/ directory from a previous build blocks Packer:
mv disk-image disk-image-old
# Then re-run the buildNetwork issues inside the QEMU VM can cause git clone to fail. The rocm-install.sh script has
built-in retry logic (3 attempts, 10-second intervals). If it still fails, check the host network
connectivity and DNS resolution.
When running with x86-mi300x-gpu.py without the --app parameter, readfile_contents is
an empty string "". Python's truthiness check elif readfile_contents: evaluates to False,
so _set_readfile_contents is never called and the readfile is not written. The guest's
run_gem5_app.sh receives an empty file from m5 readfile and exits immediately.
Solution: Always specify the --app parameter when running GPU simulations.
DRAM device capacity (16384 Mbytes) does not match the address range assigned (8192 Mbytes)
This is a configuration warning from the gem5 memory system and does not affect simulation correctness.
| File | Purpose |
|---|---|
scripts/run_mi300x_fs.sh |
Main orchestration script |
scripts/Dockerfile.run |
Runtime Docker image definition |
configs/example/gem5_library/x86-mi300x-gpu.py |
stdlib simulation config |
configs/example/gpufs/mi300.py |
Legacy simulation config |
src/python/gem5/prebuilt/viper/board.py |
ViperBoard: readfile injection, driver loading |
src/python/gem5/components/devices/gpus/amdgpu.py |
MI300X device definition |
src/dev/amdgpu/amdgpu_device.cc |
GPU device model core (modified in cosim branch) |
../gem5-resources/src/x86-ubuntu-gpu-ml/scripts/rocm-install.sh |
Disk image configuration script |
../gem5-resources/src/x86-ubuntu-gpu-ml/files/load_amdgpu.sh |
Guest-side driver loading script |
../gem5-resources/src/x86-ubuntu-gpu-ml/x86-ubuntu-gpu-ml.pkr.hcl |
Packer configuration |
| Component | Version |
|---|---|
| Guest OS | Ubuntu 24.04.2 LTS |
| Guest kernel | 6.8.0-79-generic |
| ROCm | 7.0.0 |
| amdgpu DKMS | Matches ROCm 7.0 |
| gem5 build target | VEGA_X86 |
| GPU device | MI300X (DeviceID 0x74A1) |
| Coherence protocol | GPU_VIPER |