Skip to content

[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup#9953

Draft
kevinmingtarja wants to merge 1 commit into
masterfrom
slurm-fuse-cgroup-fix
Draft

[Slurm] Mount storage at provision time so FUSE mounts survive proctrack/cgroup#9953
kevinmingtarja wants to merge 1 commit into
masterfrom
slurm-fuse-cgroup-fix

Conversation

@kevinmingtarja

Copy link
Copy Markdown
Collaborator

Summary

On Slurm clusters using ProctrackType=proctrack/cgroup, storage MOUNT (goofys/gcsfuse/rclone) was mounted from an ephemeral srun step and then silently died — cgroup tears down the step's process tree on exit, killing the FUSE daemon and leaving a stale Transport endpoint is not connected mount. (On proctrack/linuxproc it happened to survive.)

This mounts storage at provision time, from the persistent batch job, so the daemon lives for the cluster's lifetime:

  • A Slurm template_override hook builds the per-store mount commands and gathers the cloud-credential files, passing them through the cluster config as extra vars (no template swap).
  • run_instances relays the credentials onto the cluster and runs the mounts on all nodes from a persistent srun step inside the batch job.
  • New ProvisionRuntimeMetadata.storage_mounts_synced lets the backend skip the runtime storage-mount step when the provisioner already mounted.
  • The per-store mount-command generation is factored into a shared helper so the provision-time and runtime paths stay in sync.
  • The mount scratch-script path now uses mktemp so concurrent multi-node mounts on a shared filesystem don't race on the same file.

Scope: non-container clusters; container clusters fall back to the existing runtime path (tracked separately).

Test plan

  • Unit tests for the mount-command helper and the template_override hook (mount/skip/container cases), plus the new metadata flag.
  • Manual end-to-end on a proctrack/cgroup Slurm cluster: launched single-node and multi-node tasks with an S3 MOUNT, confirmed the run task reads bucket contents through the mount, the backend skips the runtime mount, and sky down tears the daemon down cleanly on every node.

🤖 Generated with Claude Code

https://claude.ai/code/session_0171VFjTPgydnyWw9NBj5Z6s

…roup

On Slurm clusters with ProctrackType=proctrack/cgroup, storage MOUNT
(goofys/gcsfuse/rclone) was mounted from an ephemeral srun step and then
silently died: the FUSE daemon was killed when that step exited, leaving a
stale "Transport endpoint is not connected" mount.

Mount storage at provision time instead, from the persistent batch job, so
the daemon lives for the cluster's lifetime:

- Add a Slurm template_override hook that builds the per-store mount commands
  and gathers cloud-credential files, threading them through the cluster
  config (no template swap; just extra vars).
- run_instances relays the credentials onto the cluster and runs the mounts on
  all nodes from a persistent srun step inside the batch job.
- Add ProvisionRuntimeMetadata.storage_mounts_synced so the backend skips the
  runtime storage-mount step when the provisioner already mounted.
- Factor the per-store mount-command generation into a shared helper so the
  provision-time and runtime paths stay in sync.
- Use mktemp for the mount scratch-script path so concurrent multi-node mounts
  on a shared filesystem don't race on the same file.

Non-container clusters only for now; container clusters fall back to the
runtime path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0171VFjTPgydnyWw9NBj5Z6s

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request bakes storage mounting into the provisioning phase for Slurm to ensure FUSE daemons survive ephemeral step teardowns. It introduces a shared helper to resolve mount commands, relays cloud credentials to the shared cluster home, and uses mktemp to avoid script name collisions on shared filesystems. The review feedback is highly valuable and identifies two key issues: potential shell syntax errors due to unquoted paths in generated commands alongside NFS directory caching delays when checking mount status, and a resource leak where temporary mount scripts are not cleaned up upon failure. Implementing the suggested shell quoting, NFS cache invalidation, and subshell EXIT traps will make the mounting process significantly more robust.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +642 to +666
mount_inner = (f'cd {sky_cluster_home_dir} && export HOME="$PWD"\n'
f'set -e\n'
f'{storage_mounts_setup}\n'
f'touch {storage_mount_done_dir}/$SLURM_PROCID\n'
f'exec sleep infinity')
label = '--label ' if num_nodes > 1 else ''
storage_mount_block = (
f'echo "[storage] waiting for credential relay..."\n'
f'while [ ! -f {creds_ready_signal} ]; do sleep 0.5; done\n'
f'rm -rf {storage_mount_done_dir} && mkdir -p '
f'{storage_mount_done_dir}\n'
f'echo "[storage] mounting on {num_nodes} node(s)..."\n'
f'srun --overlap {label}--unbuffered --nodes={num_nodes} '
f'--ntasks-per-node=1 bash -c {shlex.quote(mount_inner)} &\n'
f'STORAGE_MOUNT_PID=$!\n'
f'while true; do\n'
f' ready=$(ls -1 {storage_mount_done_dir} 2>/dev/null | wc -l)\n'
f' if [ "$ready" -ge "{num_nodes}" ]; then break; fi\n'
f' if ! kill -0 $STORAGE_MOUNT_PID 2>/dev/null; then\n'
f' echo "[storage] mount step exited before mounting"\n'
f' wait $STORAGE_MOUNT_PID; exit 1\n'
f' fi\n'
f' sleep 1\n'
f'done\n'
f'echo "[storage] mounted on all nodes"')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues in this block:

  1. Shell Quoting: Paths like sky_cluster_home_dir, storage_mount_done_dir, and creds_ready_signal are unquoted in several generated shell commands. If any of these paths contain spaces or special characters (e.g., if the user's home directory or custom workdir has spaces), the generated shell commands will fail with syntax or execution errors.
  2. NFS Directory Cache Invalidation: In Slurm clusters, the home directory is typically hosted on a shared filesystem like NFS. NFS clients cache directory attributes by default (up to acdirmin, which is often 30 seconds). When compute nodes touch files in storage_mount_done_dir, the head node running the sbatch script might not see them immediately due to this caching, causing a significant delay or even a timeout. Forcing directory cache invalidation by running touch "{storage_mount_done_dir}" on the sbatch node before checking ls ensures the latest directory entries are fetched immediately.
        mount_inner = (f'cd "{sky_cluster_home_dir}" && export HOME="$PWD"\n'
                       f'set -e\n'
                       f'{storage_mounts_setup}\n'
                       f'touch "{storage_mount_done_dir}/$SLURM_PROCID"\n'
                       f'exec sleep infinity')
        label = '--label ' if num_nodes > 1 else ''
        storage_mount_block = (
            f'echo "[storage] waiting for credential relay..."\n'
            f'while [ ! -f "{creds_ready_signal}" ]; do sleep 0.5; done\n'
            f'rm -rf "{storage_mount_done_dir}" && mkdir -p '
            f'"{storage_mount_done_dir}"\n'
            f'echo "[storage] mounting on {num_nodes} node(s)..."\n'
            f'srun --overlap {label}--unbuffered --nodes={num_nodes} '
            f'--ntasks-per-node=1 bash -c {shlex.quote(mount_inner)} &\n'
            f'STORAGE_MOUNT_PID=$!\n'
            f'while true; do\n'
            f'  # Force NFS directory cache invalidation so we don\'t wait up to acdirmin (30s) to see files created by other nodes\n'
            f'  touch "{storage_mount_done_dir}" 2>/dev/null\n'
            f'  ready=$(ls -1 "{storage_mount_done_dir}" 2>/dev/null | wc -l)\n'
            f'  if [ "$ready" -ge "{num_nodes}" ]; then break; fi\n'
            f'  if ! kill -0 $STORAGE_MOUNT_PID 2>/dev/null; then\n'
            f'    echo "[storage] mount step exited before mounting"\n'
            f'    wait $STORAGE_MOUNT_PID; exit 1\n'
            f'  fi\n'
            f'  sleep 1\n'
            f'done\n'
            f'echo "[storage] mounted on all nodes"')

Comment on lines +1053 to +1058
command = ('mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || '
'mktemp -t sky_mount_XXXXXX.sh) && '
f'echo {shlex.quote(script)} > "$mount_script" && '
'chmod +x "$mount_script" && '
'bash "$mount_script" && '
'rm "$mount_script"')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the mounting script fails or is interrupted, the temporary script file created by mktemp will not be deleted because the rm command is chained with && and will be skipped. This can lead to leftover temporary files in ~/.sky or /tmp.

Using a subshell with an EXIT trap guarantees that the temporary file is cleaned up under all exit conditions (success, failure, or interruption) while correctly preserving the exit status of the mounting script.

Suggested change
command = ('mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || '
'mktemp -t sky_mount_XXXXXX.sh) && '
f'echo {shlex.quote(script)} > "$mount_script" && '
'chmod +x "$mount_script" && '
'bash "$mount_script" && '
'rm "$mount_script"')
command = (
'('
'mount_script=$(mktemp ~/.sky/mount_XXXXXX.sh 2>/dev/null || mktemp -t sky_mount_XXXXXX.sh) && '
'trap \'rm -f "$mount_script"\' EXIT && '
f'echo {shlex.quote(script)} > "$mount_script" && '
'chmod +x "$mount_script" && '
'bash "$mount_script"'
')'
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant