[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup by kevinmingtarja · Pull Request #9953 · skypilot-org/skypilot

kevinmingtarja · 2026-06-25T20:49:10Z

Summary

On Slurm clusters using ProctrackType=proctrack/cgroup, storage MOUNT (goofys/gcsfuse/rclone) was mounted from an ephemeral srun step and then silently died — cgroup tears down the step's process tree on exit, killing the FUSE daemon and leaving a stale Transport endpoint is not connected mount. (On proctrack/linuxproc it happened to survive.)

This mounts storage at provision time, from the persistent batch job, so the daemon lives for the cluster's lifetime:

A Slurm template_override hook builds the per-store mount commands and gathers the cloud-credential files, passing them through the cluster config as extra vars (no template swap).
run_instances relays the credentials onto the cluster and runs the mounts on all nodes from a persistent srun step inside the batch job.
New ProvisionRuntimeMetadata.storage_mounts_synced lets the backend skip the runtime storage-mount step when the provisioner already mounted.
The per-store mount-command generation is factored into a shared helper so the provision-time and runtime paths stay in sync.
The mount scratch-script path now uses mktemp so concurrent multi-node mounts on a shared filesystem don't race on the same file.

Scope: non-container clusters; container clusters fall back to the existing runtime path (tracked separately).

Test plan

Unit tests for the mount-command helper and the template_override hook (mount/skip/container cases), plus the new metadata flag.
Manual end-to-end on a proctrack/cgroup Slurm cluster: launched single-node and multi-node tasks with an S3 MOUNT, confirmed the run task reads bucket contents through the mount, the backend skips the runtime mount, and sky down tears the daemon down cleanly on every node.

🤖 Generated with Claude Code

https://claude.ai/code/session_0171VFjTPgydnyWw9NBj5Z6s

…roup On Slurm clusters with ProctrackType=proctrack/cgroup, storage MOUNT (goofys/gcsfuse/rclone) was mounted from an ephemeral srun step and then silently died: the FUSE daemon was killed when that step exited, leaving a stale "Transport endpoint is not connected" mount. Mount storage at provision time instead, from the persistent batch job, so the daemon lives for the cluster's lifetime: - Add a Slurm template_override hook that builds the per-store mount commands and gathers cloud-credential files, threading them through the cluster config (no template swap; just extra vars). - run_instances relays the credentials onto the cluster and runs the mounts on all nodes from a persistent srun step inside the batch job. - Add ProvisionRuntimeMetadata.storage_mounts_synced so the backend skips the runtime storage-mount step when the provisioner already mounted. - Factor the per-store mount-command generation into a shared helper so the provision-time and runtime paths stay in sync. - Use mktemp for the mount scratch-script path so concurrent multi-node mounts on a shared filesystem don't race on the same file. Non-container clusters only for now; container clusters fall back to the runtime path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0171VFjTPgydnyWw9NBj5Z6s

gemini-code-assist

Code Review

This pull request bakes storage mounting into the provisioning phase for Slurm to ensure FUSE daemons survive ephemeral step teardowns. It introduces a shared helper to resolve mount commands, relays cloud credentials to the shared cluster home, and uses mktemp to avoid script name collisions on shared filesystems. The review feedback is highly valuable and identifies two key issues: potential shell syntax errors due to unquoted paths in generated commands alongside NFS directory caching delays when checking mount status, and a resource leak where temporary mount scripts are not cleaned up upon failure. Implementing the suggested shell quoting, NFS cache invalidation, and subshell EXIT traps will make the mounting process significantly more robust.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-25T20:52:01Z

+        mount_inner = (f'cd {sky_cluster_home_dir} && export HOME="$PWD"\n'
+                       f'set -e\n'
+                       f'{storage_mounts_setup}\n'
+                       f'touch {storage_mount_done_dir}/$SLURM_PROCID\n'
+                       f'exec sleep infinity')
+        label = '--label ' if num_nodes > 1 else ''
+        storage_mount_block = (
+            f'echo "[storage] waiting for credential relay..."\n'
+            f'while [ ! -f {creds_ready_signal} ]; do sleep 0.5; done\n'
+            f'rm -rf {storage_mount_done_dir} && mkdir -p '
+            f'{storage_mount_done_dir}\n'
+            f'echo "[storage] mounting on {num_nodes} node(s)..."\n'
+            f'srun --overlap {label}--unbuffered --nodes={num_nodes} '
+            f'--ntasks-per-node=1 bash -c {shlex.quote(mount_inner)} &\n'
+            f'STORAGE_MOUNT_PID=$!\n'
+            f'while true; do\n'
+            f'  ready=$(ls -1 {storage_mount_done_dir} 2>/dev/null | wc -l)\n'
+            f'  if [ "$ready" -ge "{num_nodes}" ]; then break; fi\n'
+            f'  if ! kill -0 $STORAGE_MOUNT_PID 2>/dev/null; then\n'
+            f'    echo "[storage] mount step exited before mounting"\n'
+            f'    wait $STORAGE_MOUNT_PID; exit 1\n'
+            f'  fi\n'
+            f'  sleep 1\n'
+            f'done\n'
+            f'echo "[storage] mounted on all nodes"')


There are two issues in this block:

Shell Quoting: Paths like sky_cluster_home_dir, storage_mount_done_dir, and creds_ready_signal are unquoted in several generated shell commands. If any of these paths contain spaces or special characters (e.g., if the user's home directory or custom workdir has spaces), the generated shell commands will fail with syntax or execution errors.

NFS Directory Cache Invalidation: In Slurm clusters, the home directory is typically hosted on a shared filesystem like NFS. NFS clients cache directory attributes by default (up to acdirmin, which is often 30 seconds). When compute nodes touch files in storage_mount_done_dir, the head node running the sbatch script might not see them immediately due to this caching, causing a significant delay or even a timeout. Forcing directory cache invalidation by running touch "{storage_mount_done_dir}" on the sbatch node before checking ls ensures the latest directory entries are fetched immediately.

mount_inner = (f'cd "{sky_cluster_home_dir}" && export HOME="$PWD"\n' f'set -e\n' f'{storage_mounts_setup}\n' f'touch "{storage_mount_done_dir}/$SLURM_PROCID"\n' f'exec sleep infinity') label = '--label ' if num_nodes > 1 else '' storage_mount_block = ( f'echo "[storage] waiting for credential relay..."\n' f'while [ ! -f "{creds_ready_signal}" ]; do sleep 0.5; done\n' f'rm -rf "{storage_mount_done_dir}" && mkdir -p ' f'"{storage_mount_done_dir}"\n' f'echo "[storage] mounting on {num_nodes} node(s)..."\n' f'srun --overlap {label}--unbuffered --nodes={num_nodes} ' f'--ntasks-per-node=1 bash -c {shlex.quote(mount_inner)} &\n' f'STORAGE_MOUNT_PID=$!\n' f'while true; do\n' f' # Force NFS directory cache invalidation so we don\'t wait up to acdirmin (30s) to see files created by other nodes\n' f' touch "{storage_mount_done_dir}" 2>/dev/null\n' f' ready=$(ls -1 "{storage_mount_done_dir}" 2>/dev/null | wc -l)\n' f' if [ "$ready" -ge "{num_nodes}" ]; then break; fi\n' f' if ! kill -0 $STORAGE_MOUNT_PID 2>/dev/null; then\n' f' echo "[storage] mount step exited before mounting"\n' f' wait $STORAGE_MOUNT_PID; exit 1\n' f' fi\n' f' sleep 1\n' f'done\n' f'echo "[storage] mounted on all nodes"')

gemini-code-assist · 2026-06-25T20:52:01Z

+    command = ('mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || '
+               'mktemp -t sky_mount_XXXXXX.sh) && '
+               f'echo {shlex.quote(script)} > "$mount_script" && '
+               'chmod +x "$mount_script" && '
+               'bash "$mount_script" && '
+               'rm "$mount_script"')


If the mounting script fails or is interrupted, the temporary script file created by mktemp will not be deleted because the rm command is chained with && and will be skipped. This can lead to leftover temporary files in ~/.sky or /tmp.

Using a subshell with an EXIT trap guarantees that the temporary file is cleaned up under all exit conditions (success, failure, or interruption) while correctly preserving the exit status of the mounting script.

Suggested change

command = ('mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || '

'mktemp -t sky_mount_XXXXXX.sh) && '

f'echo {shlex.quote(script)} > "$mount_script" && '

'chmod +x "$mount_script" && '

'bash "$mount_script" && '

'rm "$mount_script"')

command = (

'('

'mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || mktemp -t sky_mount_XXXXXX.sh) && '

'trap \'rm -f "$mount_script"\' EXIT && '

f'echo {shlex.quote(script)} > "$mount_script" && '

'chmod +x "$mount_script" && '

'bash "$mount_script"'

')'

)

gemini-code-assist Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup#9953

[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup#9953
kevinmingtarja wants to merge 1 commit into
masterfrom
slurm-fuse-cgroup-fix

kevinmingtarja commented Jun 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kevinmingtarja commented Jun 25, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant