Skip to content

Commit 463f8f0

Browse files
committed
feat(launcher): expose GPUs to eval container via NVIDIA_VISIBLE_DEVICES
Benchmarks like compute-eval need to compile and execute CUDA code inside the eval container. Without GPU access, nvcc can't detect the target architecture and compiled binaries fail with cudaErrorInsufficientDriver. Export NVIDIA_VISIBLE_DEVICES=all before the eval srun and pass it through to the container. This makes pyxis/enroot expose the parent job's GPUs to the eval container. Validated with compute-eval on HSG: pass@1 went from 0% (no GPU) to 51.25% (with GPU access). Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
1 parent 4d51bee commit 463f8f0

File tree

1 file changed

+8
-1
lines changed
  • packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm

1 file changed

+8
-1
lines changed

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1014,11 +1014,18 @@ def _create_slurm_sbatch_script(
10141014
aux_extra_env_names.extend(endpoint_vars)
10151015

10161016
s += "# evaluation client\n"
1017+
s += "export NVIDIA_VISIBLE_DEVICES=all\n"
10171018
s += "srun --mpi pmix --overlap "
10181019
s += '--nodelist "${PRIMARY_NODE}" --nodes 1 --ntasks 1 '
10191020
s += "--container-image {} ".format(eval_image)
10201021
# Combine eval env vars with auxiliary endpoint env vars
1021-
all_eval_env_names = sorted(set(list(eval_env_vars.keys()) + aux_extra_env_names))
1022+
all_eval_env_names = sorted(
1023+
set(
1024+
list(eval_env_vars.keys())
1025+
+ aux_extra_env_names
1026+
+ ["NVIDIA_VISIBLE_DEVICES"]
1027+
)
1028+
)
10221029
if all_eval_env_names:
10231030
s += "--container-env {} ".format(",".join(all_eval_env_names))
10241031
if not cfg.execution.get("mounts", {}).get("mount_home", True):

0 commit comments

Comments
 (0)