Skip to content

Commit 92a3f7f

Browse files
committed
[Doc] Clarify headless MuJoCo EGL dependencies
ghstack-source-id: 6f802b7 Pull-Request: #3879
1 parent 775c527 commit 92a3f7f

2 files changed

Lines changed: 66 additions & 19 deletions

File tree

knowledge_base/DM_CONTROL_INSTALLATION.md

Lines changed: 41 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -72,19 +72,22 @@ multi-GPU machine, all rendering contends on a **single GPU** — even if the
7272
host has 8 GPUs. This inflates per-worker render time by ~3x (e.g. 17ms serial
7373
→ 54ms with 8 workers sharing one GPU's EGL queue).
7474

75-
**Root cause:** Inside Docker or SLURM containers, the NVIDIA container runtime
76-
only exposes the GPU(s) assigned to the job to EGL. `eglQueryDevicesEXT()`
77-
returns 1 device regardless of how many physical GPUs the host has.
78-
Setting `MUJOCO_EGL_DEVICE_ID` or `EGL_DEVICE_ID` to anything other than 0
79-
raises:
75+
**Common root causes:** Inside Docker or SLURM containers, the NVIDIA
76+
container runtime may expose only a subset of devices to EGL, or a minimal CUDA
77+
image may omit the NVIDIA graphics userspace libraries entirely. In those
78+
cases, `eglQueryDevicesEXT()` can return fewer devices than the node has, or
79+
EGL initialization can fail even though CUDA and `nvidia-smi` work. Setting
80+
`MUJOCO_EGL_DEVICE_ID` or `EGL_DEVICE_ID` to an unavailable EGL device raises:
8081

8182
```
8283
RuntimeError: MUJOCO_EGL_DEVICE_ID must be an integer between 0 and 0 (inclusive), got 1.
8384
```
8485

85-
Unsetting `CUDA_VISIBLE_DEVICES` in the worker does **not** help — the
86-
container isolation happens at the NVIDIA driver/runtime level, below the
87-
environment variable.
86+
Unsetting `CUDA_VISIBLE_DEVICES` in the worker does **not** help once the
87+
container runtime has hidden devices from the driver. Conversely,
88+
`NVIDIA_DRIVER_CAPABILITIES=compute,utility` by itself does not prove EGL is
89+
impossible: if matching NVIDIA EGL/GLVND userspace libraries are installed
90+
inside the container, EGL may still work.
8891

8992
**Note on variable naming:** dm_control uses `MUJOCO_EGL_DEVICE_ID` internally
9093
(which maps to the same thing as MuJoCo's variable). Historically there was
@@ -98,17 +101,42 @@ for the unification discussion.
98101

99102
**Workarounds:**
100103

101-
1. **Configure container for full GPU access.** If you control the container
104+
1. **Verify the graphics userspace stack first.** Minimal CUDA containers often
105+
omit the EGL/GLVND loader packages and NVIDIA graphics libraries. On
106+
Debian/Ubuntu images, install the generic runtime packages:
107+
108+
```bash
109+
sudo apt-get update
110+
sudo apt-get install -y libegl-dev libglvnd0 libglx0 libgles2
111+
```
112+
113+
Then verify the NVIDIA pieces are visible and match the host driver:
114+
115+
```bash
116+
nvidia-smi --query-gpu=driver_version,name --format=csv,noheader | head
117+
ldconfig -p | grep -E 'libEGL_nvidia|libnvidia-eglcore|libGLX_nvidia'
118+
ls /usr/share/glvnd/egl_vendor.d/10_nvidia.json
119+
```
120+
121+
If the NVIDIA libraries are missing, install a matching
122+
`libnvidia-gl-<driver-version>` package or provide a matching userspace
123+
bundle and point `LD_LIBRARY_PATH` / `ldconfig` at it. The GLVND vendor JSON
124+
should point EGL at `libEGL_nvidia.so.0`.
125+
126+
2. **Configure container for full GPU access.** If you control the container
102127
runtime, set `NVIDIA_VISIBLE_DEVICES=all` and
103-
`NVIDIA_DRIVER_CAPABILITIES=all` so EGL can see all GPUs. Then assign
128+
include `graphics` in `NVIDIA_DRIVER_CAPABILITIES` (or use `all`) so the
129+
driver stack and all intended GPUs are mounted. Then assign
104130
`MUJOCO_EGL_DEVICE_ID=<worker_idx % num_gpus>` per worker process
105131
**before** dm_control is imported (the EGL display is created at import
106-
time).
132+
time). For LIBERO / robosuite environments, prefer passing
133+
`render_gpu_device_id=<worker_idx % num_gpus>` to the environment
134+
constructor.
107135

108-
2. **Run outside containers.** On bare metal, `eglQueryDevicesEXT()` correctly
136+
3. **Run outside containers.** On bare metal, `eglQueryDevicesEXT()` correctly
109137
returns all GPUs (plus the X server display, if any).
110138

111-
3. **Reduce rendering overhead.** If multi-GPU rendering is not possible:
139+
4. **Reduce rendering overhead.** If multi-GPU rendering is not possible:
112140
- Lower the rendering resolution (e.g. 64x64 instead of 84x84)
113141
- Render at a lower frequency than the simulation step (frame-skip)
114142
- Use state-only observations where possible — the IPC overhead is small

knowledge_base/MUJOCO_INSTALLATION.md

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,25 @@ To do so, MuJoCo will use one of the following backends: glfw, osmesa or egl.
2323
Of these, glfw will not work in headless environments. On the other hand, osmesa
2424
will not run on GPU. Therefore, our advice is to use the egl backend.
2525

26-
If you have a sudo access on your machine, you can install the following dependencies
27-
to enable fast rendering:
26+
If you have sudo access on your machine, install the generic OpenGL/EGL
27+
dependencies needed by MuJoCo:
2828
```shell
29-
$ sudo apt-get install libglfw3 libglew2.0 libgl1-mesa-glx libosmesa6
29+
$ sudo apt-get install libglfw3 libglew2.0 libglx-mesa0 libosmesa6 \
30+
libegl-dev libglvnd0 libglx0 libgles2
3031
```
32+
For NVIDIA EGL rendering, the NVIDIA graphics userspace libraries must also
33+
be visible and compatible with the host driver. In minimal CUDA containers,
34+
`nvidia-smi` can work even when the graphics stack is missing. Check for:
35+
36+
```shell
37+
$ ldconfig -p | grep -E 'libEGL_nvidia|libnvidia-eglcore|libGLX_nvidia'
38+
$ ls /usr/share/glvnd/egl_vendor.d/10_nvidia.json
39+
```
40+
41+
If these are absent, install a `libnvidia-gl-<driver-version>` package matching
42+
the host driver, or provide a matching NVIDIA userspace bundle and add its
43+
library directory to `LD_LIBRARY_PATH` / `ldconfig`. The GLVND vendor file
44+
should point EGL at `libEGL_nvidia.so.0`.
3145
If you don't, these libraries can be installed via conda but be aware of the fact
3246
that this is not the intended workflow and things may not work as expected:
3347
```shell
@@ -39,7 +53,8 @@ $ conda install -c menpo glfw3
3953
```
4054

4155
In both cases, when running your code, you will want to tell mujoco which backend to use.
42-
This can be done by setting the appropriate environment variables.
56+
This can be done by setting the appropriate environment variables. The
57+
variables must be set before MuJoCo / dm_control / robosuite is imported.
4358
```shell
4459
$ conda env config vars set MUJOCO_GL=egl PYOPENGL_PLATFORM=egl
4560
$ conda deactivate && conda activate mujoco_env
@@ -227,6 +242,10 @@ RuntimeError: Failed to initialize OpenGL
227242
_Solution_: Make sure you have installed mujoco and all its dependencies (see instructions above).
228243
Make sure you have set the `MUJOCO_GL=egl`.
229244
Make sure you have a GPU accessible on your machine.
245+
In containers, also verify the NVIDIA EGL/GLVND userspace libraries:
246+
`libEGL_nvidia`, `libnvidia-eglcore`, `libGLX_nvidia`, and
247+
`/usr/share/glvnd/egl_vendor.d/10_nvidia.json`. A CUDA-capable container
248+
without these files can still run `nvidia-smi` but fail headless EGL.
230249
231250
12. `cannot find -lGL: No such file or directory`
232251
@@ -236,9 +255,9 @@ RuntimeError: Failed to initialize OpenGL
236255
RuntimeError: Failed to initialize OpenGL
237256
```
238257
239-
_Solution_: Install libEGL:
258+
_Solution_: Install libEGL and the GLVND runtime:
240259
241-
- Ubuntu: `sudo apt install libegl-dev libegl`
260+
- Ubuntu: `sudo apt install libegl-dev libegl1 libglvnd0 libglx0 libgles2`
242261
- CentOS: `sudo yum install mesa-libEGL mesa-libEGL-devel`
243262
- Conda: `conda install -c anaconda mesa-libegl-cos6-x86_64`
244263

0 commit comments

Comments
 (0)