-
Notifications
You must be signed in to change notification settings - Fork 263
Description
If you find a similar existing issue, please comment on that issue instead of creating a new one.
If you are submitting a feature request, please start a discussion instead of creating an issue.
Describe the bug
Hello! I recently moved from gcluster version v1.45.1 to v1.73.1-1. Since the move, the slurm controller often hangs for quite a long time, making commands like squeue hang for close to a minute. I was able to trace the wait to this kind of logs:
[2025-12-09T18:08:03.200] Node mlslurmpri-a2spotnodeset-184 now responding [2025-12-09T18:08:06.818] error: _xgetaddrinfo: getaddrinfo(mlslurmpri-a2spotnodeset-111:6818) failed: Name or service not known [2025-12-09T18:08:06.818] error: slurm_set_addr: Unable to resolve "mlslurmpri-a2spotnodeset-111" [2025-12-09T18:08:06.818] error: _thread_per_group_rpc: can't find address for host mlslurmpri-a2spotnodeset-111, check slurm.conf
which take around 4 seconds each. My understanding is that this happens when slurm wants to kill a node for which the associated GCP VM has already been deleted.
For now it seems I was able to seemingly fix the issue by manually adding this line to /etc/resolve.conf on the controller node: options edns0 trust-ad timeout:1 attempts:1.
Thanks a lot in advance for any insights you might have!
Steps to reproduce
Expected behavior
slurm commands like squeue -u $USER never takes more than a few seconds.
Actual behavior
slurm commands like squeue -u $USER take close to a minute.
Version (gcluster --version)
gcluster version - not built from official release
Built from 'main' branch.
Commit info: v1.73.1-1-g526b97efb
Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.
Only kept one partition for simplicity:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This is a modified version of the ml-slurm-v6 blueprint found in the cluster-toolkit repo examples (https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/main/examples/ml-slurm.yaml)
---
blueprint_name: ml-slurm-a100spot-v1
vars:
project_id: ## Set project id here
deployment_name: ml-slurm-a100spot
region: europe-west4
zone: europe-west4-b
zones:
- europe-west4-a
- europe-west4-b
- europe-west4-c
new_image:
family: ml-slurm
project: $(vars.project_id)
disk_size_gb: 50
metadata:
VmDnsSetting: GlobalOnly
terraform_backend_defaults:
type: gcs
configuration:
bucket: ## Set your GCS bucket for terraform state here
deployment_groups:
- group: primary
modules:
- id: network
source: modules/network/pre-existing-vpc
settings:
network_name: ml-slurm-prior-net
subnetwork_name: ml-slurm-prior-primary-subnet
- id: homefs
source: modules/file-system/pre-existing-network-storage
settings:
server_ip: 10.178.0.2
remote_mount: nfsshare
local_mount: /home
fs_type: nfs
- id: script
source: modules/scripts/startup-script
settings:
runners:
- type: shell
destination: install-ml-libraries.sh
content: |
#!/bin/bash
# this script is designed to execute on Slurm images published by SchedMD that:
# - are based on Debian distribution of Linux
# - have NVIDIA drivers pre-installed
set -e -o pipefail
# Remove bullseye-backports repository as it no longer has a Release file
rm -f /etc/apt/sources.list.d/*backports*.list
sed -i '/bullseye-backports/d' /etc/apt/sources.list
# Install zsh and related packages
apt-get update --allow-releaseinfo-change
apt-get install -y zsh curl git
# Install Oh My Zsh for all users (only if not already installed)
if [ ! -d "/root/.oh-my-zsh" ]; then
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)" "" --unattended
fi
# Set zsh as default shell for all new users
sed -i 's|SHELL=/bin/sh|SHELL=/bin/zsh|' /etc/default/useradd
echo "deb https://packages.cloud.google.com/apt google-fast-socket main" > /etc/apt/sources.list.d/google-fast-socket.list
apt-get update --allow-releaseinfo-change
apt-get install --assume-yes google-fast-socket
# Install Docker
apt-get install -y docker.io
systemctl enable docker
systemctl start docker
# Configure Docker to use systemd cgroup driver
mkdir -p /etc/docker
echo '{"exec-opts": ["native.cgroupdriver=systemd"]}' > /etc/docker/daemon.json
systemctl restart docker
# Install nvtop via snap (not available in base Debian 11 repos)
apt-get install -y snapd
systemctl enable --now snapd apparmor
snap install nvtop
ln -sf /snap/bin/nvtop /usr/local/bin/nvtop
# change default umask and uv cache
echo "umask 0002" >> /etc/bash.bashrc
echo "export UV_CACHE_DIR=/home/cache/.uv_cache" >> /etc/bash.bashrc
- type: shell
destination: configure-login-share-group.sh
content: |
#!/bin/bash
# Configure share group for login node with PAM hook
# Create share group if it doesn't exist
getent group share > /dev/null || groupadd share
# Create PAM hook script to add users to share group on SSH login
cat > /usr/local/bin/add-user-to-share-group.sh << 'PAMSCRIPT'
#!/bin/bash
[ -n "$PAM_USER" ] && /usr/sbin/usermod -aG share "$PAM_USER" 2>/dev/null
exit 0
PAMSCRIPT
chmod +x /usr/local/bin/add-user-to-share-group.sh
# Add PAM hook to sshd if not already present
if ! grep -q "add-user-to-share-group" /etc/pam.d/sshd; then
echo "session optional pam_exec.so quiet /usr/local/bin/add-user-to-share-group.sh" >> /etc/pam.d/sshd
fi
# Add bashrc refresh logic for interactive shells
if ! grep -q "_GROUPS_REFRESHED" /etc/bash.bashrc 2>/dev/null; then
cat >> /etc/bash.bashrc << 'BASHRC'
# Refresh group membership if user was added to 'share' but shell doesn't have it
if [ -n "$USER" ] && [ -z "$_GROUPS_REFRESHED" ]; then
if getent group share 2>/dev/null | grep -qw "$USER"; then
if ! id -nG 2>/dev/null | grep -qw share; then
export _GROUPS_REFRESHED=1
exec sg share newgrp share
fi
fi
fi
BASHRC
fi
echo "Login node share group configuration complete"
- id: node_startup_script
source: modules/scripts/startup-script
settings:
runners:
- type: shell
destination: mount-local-ssd.sh
content: |
#!/bin/bash
echo "[mount-local-ssd] Script started" >> /var/log/mount-local-ssd.log
if [ -b /dev/disk/by-id/google-local-nvme-ssd-0 ]; then
echo "[mount-local-ssd] Found local SSD device" >> /var/log/mount-local-ssd.log
if ! blkid /dev/disk/by-id/google-local-nvme-ssd-0; then
echo "[mount-local-ssd] Formatting local SSD" >> /var/log/mount-local-ssd.log
mkfs.ext4 -F /dev/disk/by-id/google-local-nvme-ssd-0
fi
mkdir -p /mnt/local-ssd
mount /dev/disk/by-id/google-local-nvme-ssd-0 /mnt/local-ssd
chmod 777 /mnt/local-ssd
grep -q google-local-nvme-ssd-0 /etc/fstab || echo '/dev/disk/by-id/google-local-nvme-ssd-0 /mnt/local-ssd ext4 defaults,nofail 0 2' >> /etc/fstab
echo "[mount-local-ssd] Mount complete" >> /var/log/mount-local-ssd.log
else
echo "[mount-local-ssd] No local SSD device found" >> /var/log/mount-local-ssd.log
fi
- type: shell
destination: configure-sudoers.sh
content: |
#!/bin/bash
# Grant passwordless sudo access to cluster users
# Add your users here:
# echo "username ALL=(ALL:ALL) NOPASSWD: ALL" > /etc/sudoers.d/cluster-users
chmod 440 /etc/sudoers.d/cluster-users
- type: shell
destination: configure-docker-group.sh
content: |
#!/bin/bash
# Fix docker socket permissions if it exists
[ -S /var/run/docker.sock ] && chmod 666 /var/run/docker.sock
- type: shell
destination: configure-custom-groups.sh
content: |
#!/bin/bash
# Configure custom Unix groups for OS Login users
# This script sets up group membership for both SSH and Slurm job access
# Create custom groups if they don't exist
if ! getent group share > /dev/null; then
groupadd share
fi
# Create Slurm prolog script to add users to groups before jobs run
cat > /usr/local/bin/slurm-prolog-groups.sh << 'PROLOG'
#!/bin/bash
# Slurm Prolog: Add job user to custom groups
# SLURM_JOB_USER is set by Slurm
if [ -n "$SLURM_JOB_USER" ]; then
/usr/sbin/usermod -aG share "$SLURM_JOB_USER" 2>/dev/null
fi
PROLOG
chmod +x /usr/local/bin/slurm-prolog-groups.sh
# Add group refresh to bash.bashrc for interactive shells
# This ensures users see their group membership in srun --pty bash, etc.
if ! grep -q "_GROUPS_REFRESHED" /etc/bash.bashrc 2>/dev/null; then
cat >> /etc/bash.bashrc << 'BASHRC'
# Refresh group membership if user was added to 'share' but shell doesn't have it
if [ -n "$USER" ] && [ -z "$_GROUPS_REFRESHED" ]; then
if getent group share 2>/dev/null | grep -qw "$USER"; then
if ! id -nG 2>/dev/null | grep -qw share; then
export _GROUPS_REFRESHED=1
exec sg share newgrp share
fi
fi
fi
BASHRC
fi
echo "Custom group configuration complete"
# Bucket to store training data
- id: dataset_bucket
source: community/modules/file-system/cloud-storage-bucket
settings:
name_prefix: training-data
use_deployment_name_in_bucket_name: false
random_suffix: true
region: $(vars.region)
- group: packer
modules:
- id: custom-image
source: modules/packer/custom-image
kind: packer
use:
- network
- script
settings:
# give VM a public IP to ensure startup script can reach public internet
omit_external_ip: false
source_image_project_id: [schedmd-slurm-public]
# see latest in https://github.com/GoogleCloudPlatform/slurm-gcp/blob/master/docs/images.md#published-image-family
source_image_family: slurm-gcp-6-11-debian-12
disk_size: $(vars.disk_size_gb)
disk_type: pd-ssd
image_family: $(vars.new_image.family)
# building this image does not require a GPU-enabled VM
machine_type: c2-standard-4
state_timeout: 15m
- group: cluster
modules:
- id: examples
source: modules/scripts/startup-script
- id: controller_startup_script
source: modules/scripts/startup-script
settings:
runners:
- type: shell
destination: configure-interactive-partitions.sh
content: |
#!/bin/bash
# Create a systemd service to set PowerDownOnIdle=NO for interactive partitions
# This runs after slurmctld is fully started
cat > /etc/systemd/system/slurm-interactive-config.service << 'EOF'
[Unit]
Description=Configure interactive partitions PowerDownOnIdle
After=slurmctld.service
Requires=slurmctld.service
[Service]
Type=oneshot
ExecStartPre=/bin/sleep 30
ExecStart=/bin/bash -c 'sinfo -h -o "%%P" | grep interactive | while read -r partition; do scontrol update PartitionName=$partition PowerDownOnIdle=NO; done'
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable slurm-interactive-config.service
# A100 Spot VM nodeset
- id: a2_spot_nodeset
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
use: [network]
settings:
node_count_dynamic_max: 200
bandwidth_tier: gvnic_enabled
machine_type: a2-highgpu-1g
instance_image: $(vars.new_image)
instance_image_custom: true
startup_script: $(node_startup_script.startup_script)
enable_spot_vm: true
additional_disks:
- disk_name: null
device_name: local-ssd
boot: false
auto_delete: true
disk_type: local-ssd
disk_size_gb: 375
disk_labels: {}
# A100 Spot partition
- id: gpu_a100_spot_partition
source: community/modules/compute/schedmd-slurm-gcp-v6-partition
use: [a2_spot_nodeset]
settings:
partition_name: gpua100spot
is_default: true
- id: slurm_login
source: community/modules/scheduler/schedmd-slurm-gcp-v6-login
use: [network]
settings:
machine_type: c2-standard-16
enable_login_public_ips: true
instance_image: $(vars.new_image)
instance_image_custom: true
- id: slurm_controller
source: community/modules/scheduler/schedmd-slurm-gcp-v6-controller
use:
- network
- gpu_a100_spot_partition
- homefs
- slurm_login
settings:
machine_type: c2-standard-8
enable_controller_public_ips: true
instance_image: $(vars.new_image)
instance_image_custom: true
login_startup_script: $(script.startup_script)
controller_startup_script: $(controller_startup_script.startup_script)Expanded Blueprint
If applicable, please attach or paste the expanded blueprint. The expanded blueprint can be obtained by running gcluster expand your-blueprint.yaml.
Disregard if the bug occurs when running gcluster expand ... as well.
Output and logs
Screenshots
If applicable, add screenshots to help explain your problem.
Execution environment
- OS: [macOS, ubuntu, ...]
- Shell (To find this, run
ps -p $$): [bash, zsh, ...] - go version:
Additional context
Add any other context about the problem here.