Skip to content

Latest commit

 

History

History
215 lines (121 loc) · 5.56 KB

File metadata and controls

215 lines (121 loc) · 5.56 KB

Troubleshooting

The AMD Container Toolkit is designed to integrate smoothly into Docker-based environments. However, issues may arise due to system configurations, driver installations, or runtime settings. This guide aims to provide detailed, step-by-step troubleshooting methods to identify and resolve common issues effectively.

Common Issues:

1. Driver Not Loaded

If the AMD GPU driver is not detected, verify that the amdgpu module is loaded:

lsmod | grep amdgpu

If the module is not present, attempt to load it manually:

sudo modprobe amdgpu

If you encounter errors, check the kernel logs for driver loading issues:

dmesg | grep amdgpu

This will provide information about any problems during the driver initialization.

2. Permission Denied Errors

If GPU devices are not visible inside containers:

  • Verify GPU accessibility using rocm-smi outside the container.
  • Ensure the user belongs to the following groups:
    • render
    • video

Verify your group membership:

groups $USER

If you are not a member, add yourself to the necessary groups:

sudo usermod -a -G render,video $USER

Note: Log out and back in for the changes to take effect.

3. Docker Daemon Restart Failure

If Docker fails to restart after configuring the AMD runtime, inspect the Docker logs:

sudo journalctl -u docker

Look for errors related to:

  • Container runtime conflicts
  • GPU device issues
  • Improper /etc/docker/daemon.json configuration

Verify that the runtime path is correctly set for AMD:

cat /etc/docker/daemon.json

4. Runtime Configuration Issues

If Docker does not recognize the AMD runtime, validate the Docker configuration:

cat /etc/docker/daemon.json

Ensure the runtime is set correctly:

{
   "runtimes": {
       "amd": {
           "path": "/usr/bin/amd-container-runtime",
           "runtimeArgs": []
       }
   }
}

If the configuration is missing or incorrect, regenerate it and restart Docker:

sudo amd-ctk configure runtime
sudo systemctl restart docker

5. CDI Specification Not Applied

If Docker does not recognize the GPU under CDI specifications, regenerate the CDI configuration:

sudo amd-ctk cdi generate --output=/etc/cdi/amd.json

Check the integrity of the generated specification:

cat /etc/cdi/amd.json

If issues persist, restart Docker:

sudo systemctl restart docker

6. Docker Desktop and /dev/kfd Access

Docker Desktop on Linux is not supported for GPU workloads. You may see:

docker: Error response from daemon: error gathering device information while adding custom device "/dev/kfd": no such file or directory

Why: Docker Desktop on Linux runs the Docker daemon inside a VM (or similar isolated context). That VM does not have the host's /dev/kfd and /dev/dri devices mounted, so containers started by that daemon cannot access them.

Workaround: Install Docker via the docker.io package or Docker's official repository so the daemon runs on the host and can expose these devices to containers. Alternatively, quit Docker Desktop and use Docker installed on the host.

This applies to any run that relies on host GPU devices (e.g. docker run --device=/dev/kfd --device=/dev/dri ... or docker run --runtime=amd -e AMD_VISIBLE_DEVICES=...).

Log File Reference

The AMD container runtime (amd-container-runtime) logs events and errors to the following location:

/var/log/amd-container-runtime.log

You can view logs in real-time using:

sudo tail -f /var/log/amd-container-runtime.log

This log captures detailed interactions between Docker and the AMD container runtime, including:

  • Runtime initialization
  • GPU device injection and allocation
  • OCI specification modifications
  • Exclusive GPU enforcement errors

If a container fails to start with the AMD runtime, this log will contain the specific error (e.g. GPUs [0] are exclusive and already in use), even when Docker only shows a generic runtime failure message.

Note

The amd-ctk CLI tool prints errors directly to the terminal (not to a log file). For verbose debug output from amd-ctk, use the --debug (or -d) flag:

amd-ctk --debug gpu-tracker status
amd-ctk --debug cdi validate

This prints debug-level messages to stderr, which can help diagnose GPU enumeration, tracker state, or CDI specification issues.

Diagnostic Commands

  • List Available Devices:

    amd-ctk cdi list
  • Check Runtime Configuration:

    cat /etc/docker/daemon.json
  • Inspect Docker Logs:

    sudo journalctl -u docker

Next Steps

If the above steps do not resolve your issue:

  • Validate your amdgpu driver installation with:
rocminfo
  • Verify GPU accessibility with:
rocm-smi
  • Consult the official AMD Container Toolkit documentation or reach out to the support community for advanced troubleshooting.