Skip to content

Slurm controller hangs when launching many jobs #4972

@LeoGrin

Description

@LeoGrin

If you find a similar existing issue, please comment on that issue instead of creating a new one.

If you are submitting a feature request, please start a discussion instead of creating an issue.

Describe the bug

Hello! I recently moved from gcluster version v1.45.1 to v1.73.1-1. Since the move, the slurm controller often hangs for quite a long time, making commands like squeue hang for close to a minute. I was able to trace the wait to this kind of logs:

[2025-12-09T18:08:03.200] Node mlslurmpri-a2spotnodeset-184 now responding [2025-12-09T18:08:06.818] error: _xgetaddrinfo: getaddrinfo(mlslurmpri-a2spotnodeset-111:6818) failed: Name or service not known [2025-12-09T18:08:06.818] error: slurm_set_addr: Unable to resolve "mlslurmpri-a2spotnodeset-111" [2025-12-09T18:08:06.818] error: _thread_per_group_rpc: can't find address for host mlslurmpri-a2spotnodeset-111, check slurm.conf

which take around 4 seconds each. My understanding is that this happens when slurm wants to kill a node for which the associated GCP VM has already been deleted.

For now it seems I was able to seemingly fix the issue by manually adding this line to /etc/resolve.conf on the controller node: options edns0 trust-ad timeout:1 attempts:1.

Thanks a lot in advance for any insights you might have!

Steps to reproduce

Expected behavior

slurm commands like squeue -u $USER never takes more than a few seconds.

Actual behavior

slurm commands like squeue -u $USER take close to a minute.

Version (gcluster --version)

gcluster version - not built from official release
Built from 'main' branch.
Commit info: v1.73.1-1-g526b97efb

Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Only kept one partition for simplicity:

# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This is a modified version of the ml-slurm-v6 blueprint found in the cluster-toolkit repo examples (https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/ml-slurm.yaml)
---

blueprint_name: ml-slurm-a100spot-v1

vars:
  project_id:  ## Set project id here
  deployment_name: ml-slurm-a100spot
  region: europe-west4
  zone: europe-west4-b
  zones:
  - europe-west4-a
  - europe-west4-b
  - europe-west4-c
  new_image:
    family: ml-slurm
    project: $(vars.project_id)
  disk_size_gb: 50
  metadata:
    VmDnsSetting: GlobalOnly

terraform_backend_defaults:
 type: gcs
 configuration:
   bucket: ## Set your GCS bucket for terraform state here

deployment_groups:
- group: primary
  modules:
  - id: network
    source: modules/network/pre-existing-vpc
    settings:
      network_name: ml-slurm-prior-net
      subnetwork_name: ml-slurm-prior-primary-subnet

  - id: homefs
    source: modules/file-system/pre-existing-network-storage
    settings:
      server_ip: 10.178.0.2
      remote_mount: nfsshare
      local_mount: /home
      fs_type: nfs

  - id: script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: install-ml-libraries.sh
        content: |
          #!/bin/bash
          # this script is designed to execute on Slurm images published by SchedMD that:
          # - are based on Debian distribution of Linux
          # - have NVIDIA drivers pre-installed

          set -e -o pipefail

          # Remove bullseye-backports repository as it no longer has a Release file
          rm -f /etc/apt/sources.list.d/*backports*.list
          sed -i '/bullseye-backports/d' /etc/apt/sources.list

          # Install zsh and related packages
          apt-get update --allow-releaseinfo-change
          apt-get install -y zsh curl git

          # Install Oh My Zsh for all users (only if not already installed)
          if [ ! -d "/root/.oh-my-zsh" ]; then
            sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended
          fi

          # Set zsh as default shell for all new users
          sed -i 's|SHELL=/bin/sh|SHELL=/bin/zsh|' /etc/default/useradd

          echo "deb https://packages.cloud.google.com/apt google-fast-socket main" > /etc/apt/sources.list.d/google-fast-socket.list
          apt-get update --allow-releaseinfo-change
          apt-get install --assume-yes google-fast-socket

          # Install Docker
          apt-get install -y docker.io
          systemctl enable docker
          systemctl start docker

          # Configure Docker to use systemd cgroup driver
          mkdir -p /etc/docker
          echo '{"exec-opts": ["native.cgroupdriver=systemd"]}' > /etc/docker/daemon.json
          systemctl restart docker

          # Install nvtop via snap (not available in base Debian 11 repos)
          apt-get install -y snapd
          systemctl enable --now snapd apparmor
          snap install nvtop
          ln -sf /snap/bin/nvtop /usr/local/bin/nvtop

          # change default umask and uv cache
          echo "umask 0002" >> /etc/bash.bashrc
          echo "export UV_CACHE_DIR=/home/cache/.uv_cache" >> /etc/bash.bashrc

      - type: shell
        destination: configure-login-share-group.sh
        content: |
          #!/bin/bash
          # Configure share group for login node with PAM hook
          
          # Create share group if it doesn't exist
          getent group share > /dev/null || groupadd share
          
          # Create PAM hook script to add users to share group on SSH login
          cat > /usr/local/bin/add-user-to-share-group.sh << 'PAMSCRIPT'
          #!/bin/bash
          [ -n "$PAM_USER" ] && /usr/sbin/usermod -aG share "$PAM_USER" 2>/dev/null
          exit 0
          PAMSCRIPT
          chmod +x /usr/local/bin/add-user-to-share-group.sh
          
          # Add PAM hook to sshd if not already present
          if ! grep -q "add-user-to-share-group" /etc/pam.d/sshd; then
            echo "session optional pam_exec.so quiet /usr/local/bin/add-user-to-share-group.sh" >> /etc/pam.d/sshd
          fi
          
          # Add bashrc refresh logic for interactive shells
          if ! grep -q "_GROUPS_REFRESHED" /etc/bash.bashrc 2>/dev/null; then
            cat >> /etc/bash.bashrc << 'BASHRC'
          
          # Refresh group membership if user was added to 'share' but shell doesn't have it
          if [ -n "$USER" ] && [ -z "$_GROUPS_REFRESHED" ]; then
            if getent group share 2>/dev/null | grep -qw "$USER"; then
              if ! id -nG 2>/dev/null | grep -qw share; then
                export _GROUPS_REFRESHED=1
                exec sg share newgrp share
              fi
            fi
          fi
          BASHRC
          fi
          
          echo "Login node share group configuration complete"

  - id: node_startup_script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: mount-local-ssd.sh
        content: |
          #!/bin/bash
          echo "[mount-local-ssd] Script started" >> /var/log/mount-local-ssd.log
          if [ -b /dev/disk/by-id/google-local-nvme-ssd-0 ]; then
            echo "[mount-local-ssd] Found local SSD device" >> /var/log/mount-local-ssd.log
            if ! blkid /dev/disk/by-id/google-local-nvme-ssd-0; then
              echo "[mount-local-ssd] Formatting local SSD" >> /var/log/mount-local-ssd.log
              mkfs.ext4 -F /dev/disk/by-id/google-local-nvme-ssd-0
            fi
            mkdir -p /mnt/local-ssd
            mount /dev/disk/by-id/google-local-nvme-ssd-0 /mnt/local-ssd
            chmod 777 /mnt/local-ssd
            grep -q google-local-nvme-ssd-0 /etc/fstab || echo '/dev/disk/by-id/google-local-nvme-ssd-0 /mnt/local-ssd ext4 defaults,nofail 0 2' >> /etc/fstab
            echo "[mount-local-ssd] Mount complete" >> /var/log/mount-local-ssd.log
          else
            echo "[mount-local-ssd] No local SSD device found" >> /var/log/mount-local-ssd.log
          fi
      - type: shell
        destination: configure-sudoers.sh
        content: |
          #!/bin/bash
          # Grant passwordless sudo access to cluster users
          # Add your users here:
          # echo "username ALL=(ALL:ALL) NOPASSWD: ALL" > /etc/sudoers.d/cluster-users
          chmod 440 /etc/sudoers.d/cluster-users
      - type: shell
        destination: configure-docker-group.sh
        content: |
          #!/bin/bash
          # Fix docker socket permissions if it exists
          [ -S /var/run/docker.sock ] && chmod 666 /var/run/docker.sock

      - type: shell
        destination: configure-custom-groups.sh
        content: |
          #!/bin/bash
          # Configure custom Unix groups for OS Login users
          # This script sets up group membership for both SSH and Slurm job access
          
          # Create custom groups if they don't exist
          if ! getent group share > /dev/null; then
            groupadd share
          fi
          
          # Create Slurm prolog script to add users to groups before jobs run
          cat > /usr/local/bin/slurm-prolog-groups.sh << 'PROLOG'
          #!/bin/bash
          # Slurm Prolog: Add job user to custom groups
          # SLURM_JOB_USER is set by Slurm
          if [ -n "$SLURM_JOB_USER" ]; then
            /usr/sbin/usermod -aG share "$SLURM_JOB_USER" 2>/dev/null
          fi
          PROLOG
          chmod +x /usr/local/bin/slurm-prolog-groups.sh
          
          # Add group refresh to bash.bashrc for interactive shells
          # This ensures users see their group membership in srun --pty bash, etc.
          if ! grep -q "_GROUPS_REFRESHED" /etc/bash.bashrc 2>/dev/null; then
            cat >> /etc/bash.bashrc << 'BASHRC'
          
          # Refresh group membership if user was added to 'share' but shell doesn't have it
          if [ -n "$USER" ] && [ -z "$_GROUPS_REFRESHED" ]; then
            if getent group share 2>/dev/null | grep -qw "$USER"; then
              if ! id -nG 2>/dev/null | grep -qw share; then
                export _GROUPS_REFRESHED=1
                exec sg share newgrp share
              fi
            fi
          fi
          BASHRC
          fi
          
          echo "Custom group configuration complete"
        
  # Bucket to store training data
  - id: dataset_bucket
    source: community/modules/file-system/cloud-storage-bucket
    settings:
      name_prefix: training-data
      use_deployment_name_in_bucket_name: false
      random_suffix: true
      region: $(vars.region)

- group: packer
  modules:
  - id: custom-image
    source: modules/packer/custom-image
    kind: packer
    use:
    - network
    - script
    settings:
      # give VM a public IP to ensure startup script can reach public internet
      omit_external_ip: false
      source_image_project_id: [schedmd-slurm-public]
      # see latest in https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/docs/images.md#published-image-family
      source_image_family: slurm-gcp-6-11-debian-12
      disk_size: $(vars.disk_size_gb)
      disk_type: pd-ssd
      image_family: $(vars.new_image.family)
      # building this image does not require a GPU-enabled VM
      machine_type: c2-standard-4
      state_timeout: 15m

- group: cluster
  modules:
  - id: examples
    source: modules/scripts/startup-script

  - id: controller_startup_script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: configure-interactive-partitions.sh
        content: |
          #!/bin/bash
          # Create a systemd service to set PowerDownOnIdle=NO for interactive partitions
          # This runs after slurmctld is fully started
          
          cat > /etc/systemd/system/slurm-interactive-config.service << 'EOF'
          [Unit]
          Description=Configure interactive partitions PowerDownOnIdle
          After=slurmctld.service
          Requires=slurmctld.service
          
          [Service]
          Type=oneshot
          ExecStartPre=/bin/sleep 30
          ExecStart=/bin/bash -c 'sinfo -h -o "%%P" | grep interactive | while read -r partition; do scontrol update PartitionName=$partition PowerDownOnIdle=NO; done'
          RemainAfterExit=yes
          
          [Install]
          WantedBy=multi-user.target
          EOF
          
          systemctl daemon-reload
          systemctl enable slurm-interactive-config.service

  # A100 Spot VM nodeset
  - id: a2_spot_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 200
      bandwidth_tier: gvnic_enabled
      machine_type: a2-highgpu-1g
      instance_image: $(vars.new_image)
      instance_image_custom: true
      startup_script: $(node_startup_script.startup_script)
      enable_spot_vm: true
      additional_disks:
      - disk_name: null
        device_name: local-ssd
        boot: false
        auto_delete: true
        disk_type: local-ssd
        disk_size_gb: 375
        disk_labels: {}

  # A100 Spot partition
  - id: gpu_a100_spot_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [a2_spot_nodeset]
    settings:
      partition_name: gpua100spot
      is_default: true

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [network]
    settings:
      machine_type: c2-standard-16
      enable_login_public_ips: true
      instance_image: $(vars.new_image)
      instance_image_custom: true

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use:
    - network
    - gpu_a100_spot_partition
    - homefs
    - slurm_login
    settings:
      machine_type: c2-standard-8
      enable_controller_public_ips: true
      instance_image: $(vars.new_image)
      instance_image_custom: true
      login_startup_script: $(script.startup_script)
      controller_startup_script: $(controller_startup_script.startup_script)

Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running gcluster expand your-blueprint.yaml.

Disregard if the bug occurs when running gcluster expand ... as well.

expanded.yaml

Output and logs


Screenshots

If applicable, add screenshots to help explain your problem.

Execution environment

  • OS: [macOS, ubuntu, ...]
  • Shell (To find this, run ps -p $$): [bash, zsh, ...]
  • go version:

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions