@@ -72,19 +72,22 @@ multi-GPU machine, all rendering contends on a **single GPU** — even if the
7272host has 8 GPUs. This inflates per-worker render time by ~ 3x (e.g. 17ms serial
7373→ 54ms with 8 workers sharing one GPU's EGL queue).
7474
75- ** Root cause:** Inside Docker or SLURM containers, the NVIDIA container runtime
76- only exposes the GPU(s) assigned to the job to EGL. ` eglQueryDevicesEXT() `
77- returns 1 device regardless of how many physical GPUs the host has.
78- Setting ` MUJOCO_EGL_DEVICE_ID ` or ` EGL_DEVICE_ID ` to anything other than 0
79- raises:
75+ ** Common root causes:** Inside Docker or SLURM containers, the NVIDIA
76+ container runtime may expose only a subset of devices to EGL, or a minimal CUDA
77+ image may omit the NVIDIA graphics userspace libraries entirely. In those
78+ cases, ` eglQueryDevicesEXT() ` can return fewer devices than the node has, or
79+ EGL initialization can fail even though CUDA and ` nvidia-smi ` work. Setting
80+ ` MUJOCO_EGL_DEVICE_ID ` or ` EGL_DEVICE_ID ` to an unavailable EGL device raises:
8081
8182```
8283RuntimeError: MUJOCO_EGL_DEVICE_ID must be an integer between 0 and 0 (inclusive), got 1.
8384```
8485
85- Unsetting ` CUDA_VISIBLE_DEVICES ` in the worker does ** not** help — the
86- container isolation happens at the NVIDIA driver/runtime level, below the
87- environment variable.
86+ Unsetting ` CUDA_VISIBLE_DEVICES ` in the worker does ** not** help once the
87+ container runtime has hidden devices from the driver. Conversely,
88+ ` NVIDIA_DRIVER_CAPABILITIES=compute,utility ` by itself does not prove EGL is
89+ impossible: if matching NVIDIA EGL/GLVND userspace libraries are installed
90+ inside the container, EGL may still work.
8891
8992** Note on variable naming:** dm_control uses ` MUJOCO_EGL_DEVICE_ID ` internally
9093(which maps to the same thing as MuJoCo's variable). Historically there was
@@ -98,17 +101,42 @@ for the unification discussion.
98101
99102** Workarounds:**
100103
101- 1 . ** Configure container for full GPU access.** If you control the container
104+ 1 . ** Verify the graphics userspace stack first.** Minimal CUDA containers often
105+ omit the EGL/GLVND loader packages and NVIDIA graphics libraries. On
106+ Debian/Ubuntu images, install the generic runtime packages:
107+
108+ ``` bash
109+ sudo apt-get update
110+ sudo apt-get install -y libegl-dev libglvnd0 libglx0 libgles2
111+ ```
112+
113+ Then verify the NVIDIA pieces are visible and match the host driver:
114+
115+ ``` bash
116+ nvidia-smi --query-gpu=driver_version,name --format=csv,noheader | head
117+ ldconfig -p | grep -E ' libEGL_nvidia|libnvidia-eglcore|libGLX_nvidia'
118+ ls /usr/share/glvnd/egl_vendor.d/10_nvidia.json
119+ ```
120+
121+ If the NVIDIA libraries are missing, install a matching
122+ ` libnvidia-gl-<driver-version> ` package or provide a matching userspace
123+ bundle and point ` LD_LIBRARY_PATH ` / ` ldconfig ` at it. The GLVND vendor JSON
124+ should point EGL at ` libEGL_nvidia.so.0 ` .
125+
126+ 2 . ** Configure container for full GPU access.** If you control the container
102127 runtime, set ` NVIDIA_VISIBLE_DEVICES=all ` and
103- ` NVIDIA_DRIVER_CAPABILITIES=all ` so EGL can see all GPUs. Then assign
128+ include ` graphics ` in ` NVIDIA_DRIVER_CAPABILITIES ` (or use ` all ` ) so the
129+ driver stack and all intended GPUs are mounted. Then assign
104130 ` MUJOCO_EGL_DEVICE_ID=<worker_idx % num_gpus> ` per worker process
105131 ** before** dm_control is imported (the EGL display is created at import
106- time).
132+ time). For LIBERO / robosuite environments, prefer passing
133+ ` render_gpu_device_id=<worker_idx % num_gpus> ` to the environment
134+ constructor.
107135
108- 2 . ** Run outside containers.** On bare metal, ` eglQueryDevicesEXT() ` correctly
136+ 3 . ** Run outside containers.** On bare metal, ` eglQueryDevicesEXT() ` correctly
109137 returns all GPUs (plus the X server display, if any).
110138
111- 3 . ** Reduce rendering overhead.** If multi-GPU rendering is not possible:
139+ 4 . ** Reduce rendering overhead.** If multi-GPU rendering is not possible:
112140 - Lower the rendering resolution (e.g. 64x64 instead of 84x84)
113141 - Render at a lower frequency than the simulation step (frame-skip)
114142 - Use state-only observations where possible — the IPC overhead is small
0 commit comments