[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup#9953
[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup#9953kevinmingtarja wants to merge 1 commit into
Conversation
…roup On Slurm clusters with ProctrackType=proctrack/cgroup, storage MOUNT (goofys/gcsfuse/rclone) was mounted from an ephemeral srun step and then silently died: the FUSE daemon was killed when that step exited, leaving a stale "Transport endpoint is not connected" mount. Mount storage at provision time instead, from the persistent batch job, so the daemon lives for the cluster's lifetime: - Add a Slurm template_override hook that builds the per-store mount commands and gathers cloud-credential files, threading them through the cluster config (no template swap; just extra vars). - run_instances relays the credentials onto the cluster and runs the mounts on all nodes from a persistent srun step inside the batch job. - Add ProvisionRuntimeMetadata.storage_mounts_synced so the backend skips the runtime storage-mount step when the provisioner already mounted. - Factor the per-store mount-command generation into a shared helper so the provision-time and runtime paths stay in sync. - Use mktemp for the mount scratch-script path so concurrent multi-node mounts on a shared filesystem don't race on the same file. Non-container clusters only for now; container clusters fall back to the runtime path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0171VFjTPgydnyWw9NBj5Z6s
There was a problem hiding this comment.
Code Review
This pull request bakes storage mounting into the provisioning phase for Slurm to ensure FUSE daemons survive ephemeral step teardowns. It introduces a shared helper to resolve mount commands, relays cloud credentials to the shared cluster home, and uses mktemp to avoid script name collisions on shared filesystems. The review feedback is highly valuable and identifies two key issues: potential shell syntax errors due to unquoted paths in generated commands alongside NFS directory caching delays when checking mount status, and a resource leak where temporary mount scripts are not cleaned up upon failure. Implementing the suggested shell quoting, NFS cache invalidation, and subshell EXIT traps will make the mounting process significantly more robust.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| mount_inner = (f'cd {sky_cluster_home_dir} && export HOME="$PWD"\n' | ||
| f'set -e\n' | ||
| f'{storage_mounts_setup}\n' | ||
| f'touch {storage_mount_done_dir}/$SLURM_PROCID\n' | ||
| f'exec sleep infinity') | ||
| label = '--label ' if num_nodes > 1 else '' | ||
| storage_mount_block = ( | ||
| f'echo "[storage] waiting for credential relay..."\n' | ||
| f'while [ ! -f {creds_ready_signal} ]; do sleep 0.5; done\n' | ||
| f'rm -rf {storage_mount_done_dir} && mkdir -p ' | ||
| f'{storage_mount_done_dir}\n' | ||
| f'echo "[storage] mounting on {num_nodes} node(s)..."\n' | ||
| f'srun --overlap {label}--unbuffered --nodes={num_nodes} ' | ||
| f'--ntasks-per-node=1 bash -c {shlex.quote(mount_inner)} &\n' | ||
| f'STORAGE_MOUNT_PID=$!\n' | ||
| f'while true; do\n' | ||
| f' ready=$(ls -1 {storage_mount_done_dir} 2>/dev/null | wc -l)\n' | ||
| f' if [ "$ready" -ge "{num_nodes}" ]; then break; fi\n' | ||
| f' if ! kill -0 $STORAGE_MOUNT_PID 2>/dev/null; then\n' | ||
| f' echo "[storage] mount step exited before mounting"\n' | ||
| f' wait $STORAGE_MOUNT_PID; exit 1\n' | ||
| f' fi\n' | ||
| f' sleep 1\n' | ||
| f'done\n' | ||
| f'echo "[storage] mounted on all nodes"') |
There was a problem hiding this comment.
There are two issues in this block:
- Shell Quoting: Paths like
sky_cluster_home_dir,storage_mount_done_dir, andcreds_ready_signalare unquoted in several generated shell commands. If any of these paths contain spaces or special characters (e.g., if the user's home directory or custom workdir has spaces), the generated shell commands will fail with syntax or execution errors. - NFS Directory Cache Invalidation: In Slurm clusters, the home directory is typically hosted on a shared filesystem like NFS. NFS clients cache directory attributes by default (up to
acdirmin, which is often 30 seconds). When compute nodes touch files instorage_mount_done_dir, the head node running the sbatch script might not see them immediately due to this caching, causing a significant delay or even a timeout. Forcing directory cache invalidation by runningtouch "{storage_mount_done_dir}"on the sbatch node before checkinglsensures the latest directory entries are fetched immediately.
mount_inner = (f'cd "{sky_cluster_home_dir}" && export HOME="$PWD"\n'
f'set -e\n'
f'{storage_mounts_setup}\n'
f'touch "{storage_mount_done_dir}/$SLURM_PROCID"\n'
f'exec sleep infinity')
label = '--label ' if num_nodes > 1 else ''
storage_mount_block = (
f'echo "[storage] waiting for credential relay..."\n'
f'while [ ! -f "{creds_ready_signal}" ]; do sleep 0.5; done\n'
f'rm -rf "{storage_mount_done_dir}" && mkdir -p '
f'"{storage_mount_done_dir}"\n'
f'echo "[storage] mounting on {num_nodes} node(s)..."\n'
f'srun --overlap {label}--unbuffered --nodes={num_nodes} '
f'--ntasks-per-node=1 bash -c {shlex.quote(mount_inner)} &\n'
f'STORAGE_MOUNT_PID=$!\n'
f'while true; do\n'
f' # Force NFS directory cache invalidation so we don\'t wait up to acdirmin (30s) to see files created by other nodes\n'
f' touch "{storage_mount_done_dir}" 2>/dev/null\n'
f' ready=$(ls -1 "{storage_mount_done_dir}" 2>/dev/null | wc -l)\n'
f' if [ "$ready" -ge "{num_nodes}" ]; then break; fi\n'
f' if ! kill -0 $STORAGE_MOUNT_PID 2>/dev/null; then\n'
f' echo "[storage] mount step exited before mounting"\n'
f' wait $STORAGE_MOUNT_PID; exit 1\n'
f' fi\n'
f' sleep 1\n'
f'done\n'
f'echo "[storage] mounted on all nodes"')| command = ('mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || ' | ||
| 'mktemp -t sky_mount_XXXXXX.sh) && ' | ||
| f'echo {shlex.quote(script)} > "$mount_script" && ' | ||
| 'chmod +x "$mount_script" && ' | ||
| 'bash "$mount_script" && ' | ||
| 'rm "$mount_script"') |
There was a problem hiding this comment.
If the mounting script fails or is interrupted, the temporary script file created by mktemp will not be deleted because the rm command is chained with && and will be skipped. This can lead to leftover temporary files in ~/.sky or /tmp.
Using a subshell with an EXIT trap guarantees that the temporary file is cleaned up under all exit conditions (success, failure, or interruption) while correctly preserving the exit status of the mounting script.
| command = ('mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || ' | |
| 'mktemp -t sky_mount_XXXXXX.sh) && ' | |
| f'echo {shlex.quote(script)} > "$mount_script" && ' | |
| 'chmod +x "$mount_script" && ' | |
| 'bash "$mount_script" && ' | |
| 'rm "$mount_script"') | |
| command = ( | |
| '(' | |
| 'mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || mktemp -t sky_mount_XXXXXX.sh) && ' | |
| 'trap \'rm -f "$mount_script"\' EXIT && ' | |
| f'echo {shlex.quote(script)} > "$mount_script" && ' | |
| 'chmod +x "$mount_script" && ' | |
| 'bash "$mount_script"' | |
| ')' | |
| ) |
Summary
On Slurm clusters using
ProctrackType=proctrack/cgroup, storageMOUNT(goofys/gcsfuse/rclone) was mounted from an ephemeralsrunstep and then silently died — cgroup tears down the step's process tree on exit, killing the FUSE daemon and leaving a staleTransport endpoint is not connectedmount. (Onproctrack/linuxprocit happened to survive.)This mounts storage at provision time, from the persistent batch job, so the daemon lives for the cluster's lifetime:
template_overridehook builds the per-store mount commands and gathers the cloud-credential files, passing them through the cluster config as extra vars (no template swap).run_instancesrelays the credentials onto the cluster and runs the mounts on all nodes from a persistentsrunstep inside the batch job.ProvisionRuntimeMetadata.storage_mounts_syncedlets the backend skip the runtime storage-mount step when the provisioner already mounted.mktempso concurrent multi-node mounts on a shared filesystem don't race on the same file.Scope: non-container clusters; container clusters fall back to the existing runtime path (tracked separately).
Test plan
template_overridehook (mount/skip/container cases), plus the new metadata flag.proctrack/cgroupSlurm cluster: launched single-node and multi-node tasks with an S3MOUNT, confirmed the run task reads bucket contents through the mount, the backend skips the runtime mount, andsky downtears the daemon down cleanly on every node.🤖 Generated with Claude Code
https://claude.ai/code/session_0171VFjTPgydnyWw9NBj5Z6s