The AMD Container Toolkit is designed to integrate smoothly into Docker-based environments. However, issues may arise due to system configurations, driver installations, or runtime settings. This guide aims to provide detailed, step-by-step troubleshooting methods to identify and resolve common issues effectively.
If the AMD GPU driver is not detected, verify that the amdgpu module is loaded:
lsmod | grep amdgpuIf the module is not present, attempt to load it manually:
sudo modprobe amdgpuIf you encounter errors, check the kernel logs for driver loading issues:
dmesg | grep amdgpuThis will provide information about any problems during the driver initialization.
If GPU devices are not visible inside containers:
- Verify GPU accessibility using rocm-smi outside the container.
- Ensure the user belongs to the following groups:
- render
- video
Verify your group membership:
groups $USERIf you are not a member, add yourself to the necessary groups:
sudo usermod -a -G render,video $USERNote: Log out and back in for the changes to take effect.
If Docker fails to restart after configuring the AMD runtime, inspect the Docker logs:
sudo journalctl -u dockerLook for errors related to:
- Container runtime conflicts
- GPU device issues
- Improper /etc/docker/daemon.json configuration
Verify that the runtime path is correctly set for AMD:
cat /etc/docker/daemon.jsonIf Docker does not recognize the AMD runtime, validate the Docker configuration:
cat /etc/docker/daemon.jsonEnsure the runtime is set correctly:
{
"runtimes": {
"amd": {
"path": "/usr/bin/amd-container-runtime",
"runtimeArgs": []
}
}
}If the configuration is missing or incorrect, regenerate it and restart Docker:
sudo amd-ctk configure runtime
sudo systemctl restart dockerIf Docker does not recognize the GPU under CDI specifications, regenerate the CDI configuration:
sudo amd-ctk cdi generate --output=/etc/cdi/amd.jsonCheck the integrity of the generated specification:
cat /etc/cdi/amd.jsonIf issues persist, restart Docker:
sudo systemctl restart dockerDocker Desktop on Linux is not supported for GPU workloads. You may see:
docker: Error response from daemon: error gathering device information while adding custom device "/dev/kfd": no such file or directory
Why: Docker Desktop on Linux runs the Docker daemon inside a VM (or similar isolated context). That VM does not have the host's /dev/kfd and /dev/dri devices mounted, so containers started by that daemon cannot access them.
Workaround: Install Docker via the docker.io package or Docker's official repository so the daemon runs on the host and can expose these devices to containers. Alternatively, quit Docker Desktop and use Docker installed on the host.
This applies to any run that relies on host GPU devices (e.g. docker run --device=/dev/kfd --device=/dev/dri ... or docker run --runtime=amd -e AMD_VISIBLE_DEVICES=...).
The AMD container runtime (amd-container-runtime) logs events and errors to the following location:
/var/log/amd-container-runtime.log
You can view logs in real-time using:
sudo tail -f /var/log/amd-container-runtime.logThis log captures detailed interactions between Docker and the AMD container runtime, including:
- Runtime initialization
- GPU device injection and allocation
- OCI specification modifications
- Exclusive GPU enforcement errors
If a container fails to start with the AMD runtime, this log will contain the specific error (e.g. GPUs [0] are exclusive and already in use), even when Docker only shows a generic runtime failure message.
Note
The amd-ctk CLI tool prints errors directly to the terminal (not to a log file). For verbose debug output from amd-ctk, use the --debug (or -d) flag:
amd-ctk --debug gpu-tracker status
amd-ctk --debug cdi validateThis prints debug-level messages to stderr, which can help diagnose GPU enumeration, tracker state, or CDI specification issues.
List Available Devices:
amd-ctk cdi list
Check Runtime Configuration:
cat /etc/docker/daemon.json
Inspect Docker Logs:
sudo journalctl -u docker
If the above steps do not resolve your issue:
- Validate your amdgpu driver installation with:
rocminfo- Verify GPU accessibility with:
rocm-smi- Consult the official AMD Container Toolkit documentation or reach out to the support community for advanced troubleshooting.