Slurm controller hangs when launching many jobs

**If you find a similar existing issue, please comment on that issue instead of creating a new one.**

**If you are submitting a feature request, please start a [discussion](https://github.com/GoogleCloudPlatform/hpc-toolkit/discussions/new?category=ideas-and-feature-requests) instead of creating an issue.**

### Describe the bug

Hello! I recently moved from `gcluster` version v1.45.1 to v1.73.1-1. Since the move, the slurm controller often hangs for quite a long time, making commands like `squeue` hang for close to a minute. I was able to trace the wait to this kind of logs:
```
[2025-12-09T18:08:03.200] Node mlslurmpri-a2spotnodeset-184 now responding [2025-12-09T18:08:06.818] error: _xgetaddrinfo: getaddrinfo(mlslurmpri-a2spotnodeset-111:6818) failed: Name or service not known [2025-12-09T18:08:06.818] error: slurm_set_addr: Unable to resolve "mlslurmpri-a2spotnodeset-111" [2025-12-09T18:08:06.818] error: _thread_per_group_rpc: can't find address for host mlslurmpri-a2spotnodeset-111, check slurm.conf
```
which take around 4 seconds each. My understanding is that this happens when slurm wants to kill a node for which the associated GCP VM has already been deleted.

For now it seems I was able to seemingly fix the issue by manually adding this line to `/etc/resolve.conf` on the controller node: `options edns0 trust-ad timeout:1 attempts:1`.

Thanks a lot in advance for any insights you might have!

### Steps to reproduce


### Expected behavior

slurm commands like `squeue -u $USER` never takes more than a few seconds.

### Actual behavior

slurm commands like `squeue -u $USER` take close to a minute.

### Version (`gcluster --version`)
```
gcluster version - not built from official release
Built from 'main' branch.
Commit info: v1.73.1-1-g526b97efb
```


### Blueprint

If applicable, attach or paste the blueprint YAML used to produce the bug.

Only kept one partition for simplicity:
```yaml
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This is a modified version of the ml-slurm-v6 blueprint found in the cluster-toolkit repo examples (https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/ml-slurm.yaml)
---

blueprint_name: ml-slurm-a100spot-v1

vars:
  project_id:  ## Set project id here
  deployment_name: ml-slurm-a100spot
  region: europe-west4
  zone: europe-west4-b
  zones:
  - europe-west4-a
  - europe-west4-b
  - europe-west4-c
  new_image:
    family: ml-slurm
    project: $(vars.project_id)
  disk_size_gb: 50
  metadata:
    VmDnsSetting: GlobalOnly

terraform_backend_defaults:
 type: gcs
 configuration:
   bucket: ## Set your GCS bucket for terraform state here

deployment_groups:
- group: primary
  modules:
  - id: network
    source: modules/network/pre-existing-vpc
    settings:
      network_name: ml-slurm-prior-net
      subnetwork_name: ml-slurm-prior-primary-subnet

  - id: homefs
    source: modules/file-system/pre-existing-network-storage
    settings:
      server_ip: 10.178.0.2
      remote_mount: nfsshare
      local_mount: /home
      fs_type: nfs

  - id: script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: install-ml-libraries.sh
        content: |
          #!/bin/bash
          # this script is designed to execute on Slurm images published by SchedMD that:
          # - are based on Debian distribution of Linux
          # - have NVIDIA drivers pre-installed

          set -e -o pipefail

          # Remove bullseye-backports repository as it no longer has a Release file
          rm -f /etc/apt/sources.list.d/*backports*.list
          sed -i '/bullseye-backports/d' /etc/apt/sources.list

          # Install zsh and related packages
          apt-get update --allow-releaseinfo-change
          apt-get install -y zsh curl git

          # Install Oh My Zsh for all users (only if not already installed)
          if [ ! -d "/root/.oh-my-zsh" ]; then
            sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended
          fi

          # Set zsh as default shell for all new users
          sed -i 's|SHELL=/bin/sh|SHELL=/bin/zsh|' /etc/default/useradd

          echo "deb https://packages.cloud.google.com/apt google-fast-socket main" > /etc/apt/sources.list.d/google-fast-socket.list
          apt-get update --allow-releaseinfo-change
          apt-get install --assume-yes google-fast-socket

          # Install Docker
          apt-get install -y docker.io
          systemctl enable docker
          systemctl start docker

          # Configure Docker to use systemd cgroup driver
          mkdir -p /etc/docker
          echo '{"exec-opts": ["native.cgroupdriver=systemd"]}' > /etc/docker/daemon.json
          systemctl restart docker

          # Install nvtop via snap (not available in base Debian 11 repos)
          apt-get install -y snapd
          systemctl enable --now snapd apparmor
          snap install nvtop
          ln -sf /snap/bin/nvtop /usr/local/bin/nvtop

          # change default umask and uv cache
          echo "umask 0002" >> /etc/bash.bashrc
          echo "export UV_CACHE_DIR=/home/cache/.uv_cache" >> /etc/bash.bashrc

      - type: shell
        destination: configure-login-share-group.sh
        content: |
          #!/bin/bash
          # Configure share group for login node with PAM hook
          
          # Create share group if it doesn't exist
          getent group share > /dev/null || groupadd share
          
          # Create PAM hook script to add users to share group on SSH login
          cat > /usr/local/bin/add-user-to-share-group.sh << 'PAMSCRIPT'
          #!/bin/bash
          [ -n "$PAM_USER" ] && /usr/sbin/usermod -aG share "$PAM_USER" 2>/dev/null
          exit 0
          PAMSCRIPT
          chmod +x /usr/local/bin/add-user-to-share-group.sh
          
          # Add PAM hook to sshd if not already present
          if ! grep -q "add-user-to-share-group" /etc/pam.d/sshd; then
            echo "session optional pam_exec.so quiet /usr/local/bin/add-user-to-share-group.sh" >> /etc/pam.d/sshd
          fi
          
          # Add bashrc refresh logic for interactive shells
          if ! grep -q "_GROUPS_REFRESHED" /etc/bash.bashrc 2>/dev/null; then
            cat >> /etc/bash.bashrc << 'BASHRC'
          
          # Refresh group membership if user was added to 'share' but shell doesn't have it
          if [ -n "$USER" ] && [ -z "$_GROUPS_REFRESHED" ]; then
            if getent group share 2>/dev/null | grep -qw "$USER"; then
              if ! id -nG 2>/dev/null | grep -qw share; then
                export _GROUPS_REFRESHED=1
                exec sg share newgrp share
              fi
            fi
          fi
          BASHRC
          fi
          
          echo "Login node share group configuration complete"

  - id: node_startup_script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: mount-local-ssd.sh
        content: |
          #!/bin/bash
          echo "[mount-local-ssd] Script started" >> /var/log/mount-local-ssd.log
          if [ -b /dev/disk/by-id/google-local-nvme-ssd-0 ]; then
            echo "[mount-local-ssd] Found local SSD device" >> /var/log/mount-local-ssd.log
            if ! blkid /dev/disk/by-id/google-local-nvme-ssd-0; then
              echo "[mount-local-ssd] Formatting local SSD" >> /var/log/mount-local-ssd.log
              mkfs.ext4 -F /dev/disk/by-id/google-local-nvme-ssd-0
            fi
            mkdir -p /mnt/local-ssd
            mount /dev/disk/by-id/google-local-nvme-ssd-0 /mnt/local-ssd
            chmod 777 /mnt/local-ssd
            grep -q google-local-nvme-ssd-0 /etc/fstab || echo '/dev/disk/by-id/google-local-nvme-ssd-0 /mnt/local-ssd ext4 defaults,nofail 0 2' >> /etc/fstab
            echo "[mount-local-ssd] Mount complete" >> /var/log/mount-local-ssd.log
          else
            echo "[mount-local-ssd] No local SSD device found" >> /var/log/mount-local-ssd.log
          fi
      - type: shell
        destination: configure-sudoers.sh
        content: |
          #!/bin/bash
          # Grant passwordless sudo access to cluster users
          # Add your users here:
          # echo "username ALL=(ALL:ALL) NOPASSWD: ALL" > /etc/sudoers.d/cluster-users
          chmod 440 /etc/sudoers.d/cluster-users
      - type: shell
        destination: configure-docker-group.sh
        content: |
          #!/bin/bash
          # Fix docker socket permissions if it exists
          [ -S /var/run/docker.sock ] && chmod 666 /var/run/docker.sock

      - type: shell
        destination: configure-custom-groups.sh
        content: |
          #!/bin/bash
          # Configure custom Unix groups for OS Login users
          # This script sets up group membership for both SSH and Slurm job access
          
          # Create custom groups if they don't exist
          if ! getent group share > /dev/null; then
            groupadd share
          fi
          
          # Create Slurm prolog script to add users to groups before jobs run
          cat > /usr/local/bin/slurm-prolog-groups.sh << 'PROLOG'
          #!/bin/bash
          # Slurm Prolog: Add job user to custom groups
          # SLURM_JOB_USER is set by Slurm
          if [ -n "$SLURM_JOB_USER" ]; then
            /usr/sbin/usermod -aG share "$SLURM_JOB_USER" 2>/dev/null
          fi
          PROLOG
          chmod +x /usr/local/bin/slurm-prolog-groups.sh
          
          # Add group refresh to bash.bashrc for interactive shells
          # This ensures users see their group membership in srun --pty bash, etc.
          if ! grep -q "_GROUPS_REFRESHED" /etc/bash.bashrc 2>/dev/null; then
            cat >> /etc/bash.bashrc << 'BASHRC'
          
          # Refresh group membership if user was added to 'share' but shell doesn't have it
          if [ -n "$USER" ] && [ -z "$_GROUPS_REFRESHED" ]; then
            if getent group share 2>/dev/null | grep -qw "$USER"; then
              if ! id -nG 2>/dev/null | grep -qw share; then
                export _GROUPS_REFRESHED=1
                exec sg share newgrp share
              fi
            fi
          fi
          BASHRC
          fi
          
          echo "Custom group configuration complete"
        
  # Bucket to store training data
  - id: dataset_bucket
    source: community/modules/file-system/cloud-storage-bucket
    settings:
      name_prefix: training-data
      use_deployment_name_in_bucket_name: false
      random_suffix: true
      region: $(vars.region)

- group: packer
  modules:
  - id: custom-image
    source: modules/packer/custom-image
    kind: packer
    use:
    - network
    - script
    settings:
      # give VM a public IP to ensure startup script can reach public internet
      omit_external_ip: false
      source_image_project_id: [schedmd-slurm-public]
      # see latest in https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/docs/images.md#published-image-family
      source_image_family: slurm-gcp-6-11-debian-12
      disk_size: $(vars.disk_size_gb)
      disk_type: pd-ssd
      image_family: $(vars.new_image.family)
      # building this image does not require a GPU-enabled VM
      machine_type: c2-standard-4
      state_timeout: 15m

- group: cluster
  modules:
  - id: examples
    source: modules/scripts/startup-script

  - id: controller_startup_script
    source: modules/scripts/startup-script
    settings:
      runners:
      - type: shell
        destination: configure-interactive-partitions.sh
        content: |
          #!/bin/bash
          # Create a systemd service to set PowerDownOnIdle=NO for interactive partitions
          # This runs after slurmctld is fully started
          
          cat > /etc/systemd/system/slurm-interactive-config.service << 'EOF'
          [Unit]
          Description=Configure interactive partitions PowerDownOnIdle
          After=slurmctld.service
          Requires=slurmctld.service
          
          [Service]
          Type=oneshot
          ExecStartPre=/bin/sleep 30
          ExecStart=/bin/bash -c 'sinfo -h -o "%%P" | grep interactive | while read -r partition; do scontrol update PartitionName=$partition PowerDownOnIdle=NO; done'
          RemainAfterExit=yes
          
          [Install]
          WantedBy=multi-user.target
          EOF
          
          systemctl daemon-reload
          systemctl enable slurm-interactive-config.service

  # A100 Spot VM nodeset
  - id: a2_spot_nodeset
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    use: [network]
    settings:
      node_count_dynamic_max: 200
      bandwidth_tier: gvnic_enabled
      machine_type: a2-highgpu-1g
      instance_image: $(vars.new_image)
      instance_image_custom: true
      startup_script: $(node_startup_script.startup_script)
      enable_spot_vm: true
      additional_disks:
      - disk_name: null
        device_name: local-ssd
        boot: false
        auto_delete: true
        disk_type: local-ssd
        disk_size_gb: 375
        disk_labels: {}

  # A100 Spot partition
  - id: gpu_a100_spot_partition
    source: community/modules/compute/schedmd-slurm-gcp-v6-partition
    use: [a2_spot_nodeset]
    settings:
      partition_name: gpua100spot
      is_default: true

  - id: slurm_login
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
    use: [network]
    settings:
      machine_type: c2-standard-16
      enable_login_public_ips: true
      instance_image: $(vars.new_image)
      instance_image_custom: true

  - id: slurm_controller
    source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
    use:
    - network
    - gpu_a100_spot_partition
    - homefs
    - slurm_login
    settings:
      machine_type: c2-standard-8
      enable_controller_public_ips: true
      instance_image: $(vars.new_image)
      instance_image_custom: true
      login_startup_script: $(script.startup_script)
      controller_startup_script: $(controller_startup_script.startup_script)
```

### Expanded Blueprint

If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running `gcluster expand your-blueprint.yaml`.

Disregard if the bug occurs when running `gcluster expand ...` as well.


[expanded.yaml](https://github.com/user-attachments/files/24065270/expanded.yaml)


### Output and logs

```text

```

### Screenshots

If applicable, add screenshots to help explain your problem.

### Execution environment

- OS: [macOS, ubuntu, ...]
- Shell (To find this, run `ps -p $$`): [bash, zsh, ...]
- go version:

### Additional context

Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm controller hangs when launching many jobs #4972

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Version (`gcluster --version`)

Blueprint

Expanded Blueprint

Output and logs

Screenshots

Execution environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slurm controller hangs when launching many jobs #4972

Description

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Version (gcluster --version)

Blueprint

Expanded Blueprint

Output and logs

Screenshots

Execution environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Version (`gcluster --version`)