Setting up environments for user applications on an HPC cluster is often tedious and divert attention from the application itself. To tackle this, containerization support is a great way to simplify the process. HPC clusters often use Slurm workload manager along with containerization tools such as Singularity/Apptainer, Rootless Docker (environment module), or Enroot+Pyxis, for easier environment management.
Based on my experience working with Slurm and all these containerization options, I personally prefer Slurm with Enroot+Pyxis as it offers the simplest workflow for users familiar with Docker, while also ensuring minimal performance overhead.
The setup instructions are already documented in the official Pyxis repository. Enroot documentation also contains detailed usage guide on single-node tasks. However, there is no documentation for running multi-node tasks directly with Enroot without Pyxis. Using Enroot directly without Pyxis may be needed when you have direct (bare metal) access to multiple Ubuntu nodes and do not want to set up a scheduler or use a workload manager like Slurm. In such cases, Enroot itself alone can serve as a lightweight and effective containerization solution for HPC environments.
This (unofficial) document describes the minimal setup required for running multi-node tasks directly with Enroot without Pyxis and Slurm. Please note that running multi-node tasks with Enroot is more of a hack than a fool-proof solution, the recommended method for multi-node tasks remains using Enroot+Pyxis.
- GPU Hardware: Two nodes, each with eight H200 NVL GPUs
The commands below can be easily adapted to arbitrary number of nodes (or other NVIDIA GPUs).
- Network Hardware: Each node is equipped with four ConnectX-7 NICs, providing eight InfiniBand connections per node through InfiniBand switches
Ethernet (RoCE) should also work in theory.
- OS: Ubuntu 22.04.5 LTS
Other Linux distributions should also work.
- Pre-installed Software: NVIDIA Driver, Docker, NVIDIA Container Toolkit, InfiniBand Driver (DOCA-Host or MLNX_OFED (legacy)), and optionally GDR Copy (Driver).
Not sure if Docker and Container Toolkit are required, but we have them installed by default on all nodes.
- A running OpenSM on the InfiniBand switch or manually launched.
Usually it's already running on the InfiniBand switch.
- All nodes have IP addresses assigned within a private network
Assume require VPN connection to access the nodes.
Most multi-node cluster will have basic user account, NFS, and SSH configured. If not, you'll need to set up these first.
Create a user account (with same username/UID/GID) with sudo privileges on all nodes, with the home directory set to /mnt/home/<username>. If the user already exists, skip this step.
You'll want to use tools like LDAP to manage the user account. Alternatively, you can manually create the user account on all nodes:
# Create user account
USERNAME=<username>
sudo useradd -m -d /mnt/home/${USERNAME} -s /bin/bash -u 10001 -g 10001 -G sudo ${USERNAME}
# Enable password-less sudo
echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoersSet up an NFS server on the head node and mount the shared home directory on all other nodes. If NFS is already configured, ensure the necessary paths are exported and mounted correctly.
On head node:
# Install
sudo apt update
sudo apt install -y nfs-kernel-server
sudo systemctl start nfs-kernel-server.service
# Export
sudo mkdir -p /mnt/home/${USER}
sudo chown -R $(id -u):$(id -g) /mnt/home/${USER}
echo "/mnt/home *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
echo "/opt/enroot *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -aOn all other nodes:
# Install
sudo apt install -y nfs-common
# Mount
NFS_SERVER=<HEAD-NODE-IP>
sudo mkdir -p /mnt/home
sudo mkdir -p /opt/enroot
echo "$NFS_SERVER:/mnt/home /mnt/home nfs defaults 0 0" | sudo tee -a /etc/fstab
echo "$NFS_SERVER:/opt/enroot /opt/enroot nfs defaults 0 0" | sudo tee -a /etc/fstab
sudo mount -a
mount | grep nfsSkip this step if password-less SSH is already configured.
On head node:
# Generate SSH key
ssh-keygen -t ed25519 # and press Enter multiple times to accept the default values
# Copy to shared home directory (will automatically work on all nodes due to shared home directory)
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keysDownload Enroot to the shared directory.
On head node, download Enroot deb files:
cd /mnt/home/${USER}
mkdir -p enroot/deb && cd ~/enroot/deb
arch=$(dpkg --print-architecture)
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot_3.5.0-1_${arch}.deb
curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.5.0/enroot+caps_3.5.0-1_${arch}.deb # optionalOn all nodes, install Enroot:
# run once for all nodes (including head node)
IP=<IP>
ssh $IP 'cd ~/enroot/deb && sudo apt install -y ./*.deb'You may run the ubuntu example or cuda example to test the installation. But a simple enroot version is usually sufficient.
On all nodes, edit Enroot config for shared container file system (assumes Bash shell):
# run once for all nodes (including head node)
IP=<IP>
# You may edit /etc/enroot/enroot.conf directly, but the following idempotent commands are recommended for consistency
# Set ENROOT_DATA_PATH to /opt/enroot/data
ssh $IP "sudo grep -q '^ENROOT_DATA_PATH[[:space:]]\+/opt/enroot/data\$' /etc/enroot/enroot.conf || sudo sed -i '/^#ENROOT_DATA_PATH[[:space:]]\+\\\${XDG_DATA_HOME}\/enroot\$/a ENROOT_DATA_PATH /opt/enroot/data' /etc/enroot/enroot.conf"
# Set ENROOT_MOUNT_HOME to yes
ssh $IP "sudo grep -q '^ENROOT_MOUNT_HOME[[:space:]]\+yes\$' /etc/enroot/enroot.conf || sudo sed -i '/^#ENROOT_MOUNT_HOME[[:space:]]\+no\$/a ENROOT_MOUNT_HOME yes' /etc/enroot/enroot.conf"On head node, create data/workspace directory and add Enroot hook for OpenMPI:
# Create data directory
sudo mkdir -p /opt/enroot/data
sudo chmod 1777 /opt/enroot/data
# Create workspace directory
sudo mkdir -p /opt/enroot/workspace
sudo chmod 1777 /opt/enroot/workspace
mkdir /opt/enroot/workspace/${USER}
# Add Enroot hook for OpenMPI
sudo tee /etc/enroot/hooks.d/ompi.sh > /dev/null << 'EOF'
#!/bin/bash
echo "OMPI_MCA_orte_launch_agent=enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"
EOF
sudo chmod +x /etc/enroot/hooks.d/ompi.shOn head node, download NGC HPC-Benchmarks container image to the shared directory:
cd /mnt/home/${USER}/enroot
mkdir -p sqsh && cd sqsh
enroot import docker://nvcr.io#nvidia/hpc-benchmarks:25.04
ls ./nvidia+hpc-benchmarks+25.04.sqshOn head node, create container with current username as prefix (the created container will be visible on all nodes due to the ENROOT_DATA_PATH setting we set earlier):
cd /mnt/home/${USER}/enroot/sqsh
enroot create --name ${USER}-hpc-benchmarks-25-04 nvidia+hpc-benchmarks+25.04.sqsh
ls /opt/enroot/data/${USER}-hpc-benchmarks-25-04
enroot list
# Single node MPI quick test
enroot start ${USER}-hpc-benchmarks-25-04 mpirun hostnameCreate a workspace directory, and store the hostfile there (for multi-node tasks, assuming all nodes have 8 GPUs):
cd /opt/enroot/workspace/${USER}
IP_LIST=("<IP1>" "<IP2>")
for IP in "${IP_LIST[@]}"; do
echo "$IP slots=8" >> hosts.txt
done
cat hosts.txtRun multi-node quick test (assuming 2 nodes with 8 GPUs each):
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 --hostfile /app/hosts.txt hostname
# should see 8 hostnames for each nodeNote: The command prefix before the container name (i.e.,
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app) must match exactly what is set in the/etc/enroot/hooks.d/ompi.shhook. Do not modify this part of the command, or the multi-node launch will not work correctly. You can change everything after the container name (e.g.,mpirun ...) though. In addition, it is highly recommended to use absolute paths in the command.
Prepare a suitable HPL.dat file for your machine.
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04
# in the container
cp hpl-linux-x86_64/sample-dat/HPL-H200-8GPUs.dat /app/
cp hpl-linux-x86_64/sample-dat/HPL-H200-16GPUs.dat /app/
# Ctrl+D to exit the containerTest single node HPL:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 8 ./hpl.sh \
--dat /app/HPL-H200-8GPUs.datThe result may not be optimal. You may tune the dat file, mpirun flags, and environment variables according to your machine for better HPL performance.
Test multi-node HPL:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 \
--hostfile /app/hosts.txt \
--mca mca_base_env_list "HPL_USE_NVSHMEM=0" \
./hpl.sh \
--dat /app/HPL-H200-16GPUs.datThe result may not be optimal. You may tune the dat file, mpirun flags, and environment variables according to your machine for better HPL performance.
NVSHMEM is disabled by default to provide less assumptions about the network hardware. See: How to run HPL script over Ethernet.
If you have NVSHMEM correctly set up, you can enable it by removing HPL_USE_NVSHMEM=0 in the mpirun command.
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 \
--hostfile /app/hosts.txt \
./hpl.sh \
--dat /app/HPL-H200-16GPUs.datIf you are using NVSHMEM on InfiniBand network, you should have correctly set up GPUDirect RDMA (DMA-BUD, or nvidia-peermem (legacy) installed with
.rundriver install, or nv_peer_memory (legacy) on GitHub) on all nodes. Otherwise the multi-node launch will fail. See more at: NVSHMEM requirements.
All done! Now you can use Enroot to run your multi-node tasks with ease!
Feel free to skip the following sections and start running your tasks!
Print sample Slurm scripts for HPL-MxP:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04
# in the container
cat hpl-mxp-linux-x86_64/sample-slurm/hpl-mxp-enroot-1N.sub
cat hpl-mxp-linux-x86_64/sample-slurm/hpl-mxp-enroot-2N.sub
# Ctrl+D to exit the containerTest single node HPL-MxP:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 8 \
--hostfile /app/hosts.txt \
./hpl-mxp.sh \
--n 380000 --nb 2048 --nprow 4 --npcol 2 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7Tuning would be required to achieve the best performance on your machine.
Test multi-node HPL-MxP:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04 mpirun -np 16 \
--hostfile /app/hosts.txt \
./hpl-mxp.sh \
--n 530000 --nb 2048 --nprow 4 --npcol 4 --nporder row --gpu-affinity 0:1:2:3:4:5:6:7Tuning would be required to achieve the best performance on your machine.
All software other than system-level drivers and kernel modules are included in the container.
NVIDIA HPC-Benchmarks 25.04 includes:
- Sample files such as HPL dat files.
- HPL, HPL-MxP, HPCG, STREAM
- NCCL, NVSHMEM, GDR Copy (Library)
- NVIDIA Optimized Frameworks 25.01
- including: CUDA, cuBLAS, cuDNN, cuTENSOR, DALI, NCCL, TensorRT, rdma-core, NVIDIA HPC-X (OpenMPI, UCX), Nsight Compute, Nsight Systems, and more (by searching
25.01).
- including: CUDA, cuBLAS, cuDNN, cuTENSOR, DALI, NCCL, TensorRT, rdma-core, NVIDIA HPC-X (OpenMPI, UCX), Nsight Compute, Nsight Systems, and more (by searching
So basically all software other than those listed in the Sample Environment is included in the container.
For sanity check:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${USER}-hpc-benchmarks-25-04
# in the container
ucx_info -v
ompi_info | grep "MPI extensions"
# ...
# Ctrl+D to exit the containerYou can see that both UCX and OpenMPI are built with CUDA support, even though you may not have installed UCX, OpenMPI, or even CUDA on the host OS.
To the best of my knowledge, this Enroot multi-node setup (or hack) is first introduced by @3XX0 in this issue.
Aside from normal single-node Enroot setup, there are four major points in Multi-node Setup:
-
Setting
ENROOT_DATA_PATHto a NFS shared directory in/etc/enroot/enroot.conf.
This path is used to store the container file system (unpacked byenroot create). Setting it to a NFS shared directory ensures that the container file system is visible (byenroot list) on all nodes once being created. Without this option, user need to manually runenroot createon each node, which is tedious and error-prone. Executingenroot removewill delete the container file system from this path. (Reference) -
Setting
ENROOT_MOUNT_HOMEtoyesin/etc/enroot/enroot.conf.
Mounting the home directory allows the container to access the~/.sshfolder. This is necessary for MPI (mpirun) to automatically use password-less SSH authentication to launchortedprocesses on all nodes. (Reference) -
Setting
OMPI_MCA_orte_launch_agenttoenroot start ... orted.
Setting theOMPI_MCA_orte_launch_agentenvironment variable is a common trick to makempirunlaunch theortedprocess within a (Enroot/Singularity) container. Basically it tellsmpirunto runenroot start ... ortedinstead of runningorteddirectly. -
Adding a (executable) hook for OpenMPI in
/etc/enroot/hooks.d/ompi.sh.
This hook removes the need of manually settingOMPI_MCA_orte_launch_agentenvironment variable every time you run a task viaenroot start. In our case, without this hook will require running the following everytime:enroot start -e OMPI_MCA_orte_launch_agent='enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${CONTAINER_NAME} orted' --rw --mount /opt/enroot/workspace/${USER}:/app ${CONTAINER_NAME} mpirun -np 16 --hostfile hosts.txt ...
Adding the following pre-start hook script:
#!/bin/bash echo "OMPI_MCA_orte_launch_agent=enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"simplifies the command to:
enroot start --rw --mount /opt/enroot/workspace/${USER}:/app ${CONTAINER_NAME} mpirun -np 16 --hostfile hosts.txt ...
which makes life easier. (Reference)
Running mpirun ... enroot start ... may prevent intra-node optimizations, resulting in worse performance. In addition, using OpenMPI in the Enroot container makes life easier, as we don't even need to install OpenMPI on any node.
On host, run the following:
-
Check NVIDIA Driver
nvidia-smi
-
Check
nv_peer_memlsmod | grep nv_peer_mem -
Check
nvidia_peermemlsmod | grep nvidia_peermem -
Check GDR Copy (Driver)
lsmod | grep gdrdrv
- This approach is less robust compared to using Pyxis and Slurm.
- Requires specifying fixed Enroot flags in the pre-start hook.
This note has been made possible through the support of ElsaLab and NVIDIA AI Technology Center (NVAITC).
Thanks to ElsaLab HPC Study Group and especially Kuan-Hsun Tu for environment setup and discussion on multi-user support.
And of course, thanks to @3XX0 for sharing this workaround.