Skip to content

Commit 2128575

Browse files
authored
iris/gcp: retry TPU worker docker pull through snap/AR-auth races (#6567)
a v4-1024 reserved slice (20260623-0144-c52315c3) sat in "booting" with 122/128 workers healthy and 6 never presenting. GCP reported the slice READY/HEALTHY with no symptoms and every VM had an IP — the gap was entirely at the Iris worker-agent layer. Root cause: on tpu-ubuntu2204-base, gcloud ships as a snap that is occasionally not yet usable when the worker startup script reaches the docker_pull phase. The startup logs on the stranded workers showed two variants of the same race: - worker-46: "[iris-init] Warning: gcloud not found; AR pull may fail without prior auth" — /snap/bin/gcloud not yet linked. - worker-44: gcloud present and `configure-docker` even succeeded, but `docker-credential-gcloud` failed mid-pull with "error: the required argument <snap> was not provided". In both cases docker fell back to an unauthenticated request and Artifact Registry denied it: denied: Unauthenticated request. ... permission "artifactregistry.repositories.downloadArtifacts" on resource ".../repositories/ghcr-mirror" startup-script exit status 1 Because the pull was a single `sudo docker pull` under `set -e`, run BEFORE the self-healing `--restart=unless-stopped` worker container is created, one transient denial killed the script and stranded the worker permanently. The comments downstream promise pull races "self-heal", but the pull is outside that loop. Its /health never came up, and once enough siblings are stranded the slice health probe (`_run_tpu_bootstrap`, 2h deadline for >=64 workers) reaps the ENTIRE slice — discarding the 122 healthy workers — and recreates it, where it can lose the race again. A healthy worker (worker-0) hit the same slow snap but won the race: its `configure-docker` took ~25s, then pulled fine. So ~95% of nodes survive and a handful do not — matching the scattered 6/128.
1 parent 360ff95 commit 2128575

2 files changed

Lines changed: 61 additions & 76 deletions

File tree

lib/iris/src/iris/cluster/backends/gcp/bootstrap.py

Lines changed: 61 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -207,19 +207,45 @@ def replace_var(match: re.Match) -> str:
207207
echo "[iris-init] Phase: docker_pull"
208208
echo "[iris-init] Pulling image: {{ docker_image }}"
209209
210-
# Configure Artifact Registry auth on demand.
211-
# Must run under sudo because `sudo docker pull` uses root's docker config.
210+
# Resolve the Artifact Registry host (empty for non-AR images). Auth is only
211+
# configured when pulling from AR; root's docker config is used by `sudo docker`.
212+
AR_HOST=""
212213
if echo "{{ docker_image }}" | grep -q -- "-docker.pkg.dev/"; then
213214
AR_HOST=$(echo "{{ docker_image }}" | cut -d/ -f1)
214-
echo "[iris-init] Configuring docker auth for $AR_HOST"
215-
if command -v gcloud &> /dev/null; then
216-
sudo gcloud auth configure-docker "$AR_HOST" -q || true
217-
else
218-
echo "[iris-init] Warning: gcloud not found; AR pull may fail without prior auth"
219-
fi
220215
fi
221216
222-
sudo docker pull {{ docker_image }}
217+
# Retry AR auth + pull. gcloud ships as a snap on tpu-ubuntu2204-base and can be
218+
# slow to become usable at first boot even after `snap wait system seed.loaded`:
219+
# /snap/bin/gcloud may not be linked yet, or docker-credential-gcloud may fail
220+
# mid-pull ("the required argument <snap> was not provided"). Either way docker
221+
# falls back to an unauthenticated request and Artifact Registry denies it.
222+
# Re-running configure-docker + pull on each attempt absorbs the race. This MUST
223+
# retry: the pull runs before the self-healing --restart=unless-stopped worker
224+
# container is created, so a single transient failure here strands the worker
225+
# permanently -- its /health never comes up and the slice health probe
226+
# eventually reaps the whole slice, healthy siblings included.
227+
IRIS_PULL_OK=0
228+
for attempt in $(seq 1 20); do
229+
if [ -n "$AR_HOST" ]; then
230+
echo "[iris-init] Configuring docker auth for $AR_HOST (attempt $attempt/20)"
231+
if command -v gcloud &> /dev/null; then
232+
sudo gcloud auth configure-docker "$AR_HOST" -q || true
233+
else
234+
echo "[iris-init] gcloud not yet on PATH; waiting for snap to settle"
235+
fi
236+
fi
237+
if sudo docker pull {{ docker_image }}; then
238+
IRIS_PULL_OK=1
239+
break
240+
fi
241+
echo "[iris-init] docker pull failed (attempt $attempt/20); retrying in 15s"
242+
sleep 15
243+
done
244+
245+
if [ "$IRIS_PULL_OK" -ne 1 ]; then
246+
echo "[iris-init] ERROR: docker pull failed after 20 attempts; giving up"
247+
exit 1
248+
fi
223249
224250
echo "[iris-init] Phase: config_setup"
225251
sudo mkdir -p /etc/iris
@@ -380,22 +406,38 @@ def build_worker_bootstrap_script(
380406
echo "[iris-controller] [3/5] Pulling image: {{ docker_image }}"
381407
echo "[iris-controller] This may take several minutes for large images..."
382408
383-
# Configure Artifact Registry auth on demand.
384-
# Must run under sudo because `sudo docker pull` uses root's docker config.
409+
# Resolve the Artifact Registry host (empty for non-AR images). Auth is only
410+
# configured when pulling from AR; root's docker config is used by `sudo docker`.
411+
AR_HOST=""
385412
if echo "{{ docker_image }}" | grep -q -- "-docker.pkg.dev/"; then
386413
AR_HOST=$(echo "{{ docker_image }}" | cut -d/ -f1)
387-
echo "[iris-controller] [3/5] Configuring docker auth for $AR_HOST"
388-
if command -v gcloud &> /dev/null; then
389-
sudo gcloud auth configure-docker "$AR_HOST" -q || true
390-
else
391-
echo "[iris-controller] [3/5] Warning: gcloud not found; AR pull may fail without prior auth"
392-
fi
393414
fi
394415
395-
if sudo docker pull {{ docker_image }}; then
416+
# Retry AR auth + pull -- gcloud ships as a snap and can be slow to become
417+
# usable at first boot, so a single configure-docker + pull may hit an
418+
# unauthenticated denial. Re-running both on each attempt absorbs the race.
419+
IRIS_PULL_OK=0
420+
for attempt in $(seq 1 20); do
421+
if [ -n "$AR_HOST" ]; then
422+
echo "[iris-controller] [3/5] Configuring docker auth for $AR_HOST (attempt $attempt/20)"
423+
if command -v gcloud &> /dev/null; then
424+
sudo gcloud auth configure-docker "$AR_HOST" -q || true
425+
else
426+
echo "[iris-controller] [3/5] gcloud not yet on PATH; waiting for snap to settle"
427+
fi
428+
fi
429+
if sudo docker pull {{ docker_image }}; then
430+
IRIS_PULL_OK=1
431+
break
432+
fi
433+
echo "[iris-controller] [3/5] docker pull failed (attempt $attempt/20); retrying in 15s"
434+
sleep 15
435+
done
436+
437+
if [ "$IRIS_PULL_OK" -eq 1 ]; then
396438
echo "[iris-controller] [4/5] Image pull complete"
397439
else
398-
echo "[iris-controller] [4/5] ERROR: Image pull failed"
440+
echo "[iris-controller] [4/5] ERROR: Image pull failed after 20 attempts"
399441
exit 1
400442
fi
401443

lib/iris/tests/cluster/backends/gcp/test_bootstrap.py

Lines changed: 0 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77

88
import pytest
99
from iris.cluster.backends.gcp.bootstrap import (
10-
build_controller_bootstrap_script_from_config,
1110
build_worker_bootstrap_script,
1211
render_template,
1312
rewrite_ghcr_to_ar_remote,
@@ -31,24 +30,6 @@ def _worker_config(**overrides: object) -> config_pb2.WorkerConfig:
3130
return cfg
3231

3332

34-
def test_build_worker_bootstrap_script_includes_controller_address() -> None:
35-
script = build_worker_bootstrap_script(_worker_config())
36-
37-
assert "controller_address" in script
38-
assert "10.0.0.10:10000" in script
39-
assert "gcr.io/test/iris-worker:latest" in script
40-
41-
42-
def test_build_worker_bootstrap_script_configures_ar_auth() -> None:
43-
ar_image = "us-docker.pkg.dev/hai-gcp-models/ghcr-mirror/marin-community/iris-worker:latest"
44-
cfg = _worker_config(docker_image=ar_image)
45-
46-
script = build_worker_bootstrap_script(cfg)
47-
48-
assert f'if echo "{ar_image}" | grep -q -- "-docker.pkg.dev/"' in script
49-
assert 'sudo gcloud auth configure-docker "$AR_HOST" -q || true' in script
50-
51-
5233
def test_build_worker_bootstrap_script_requires_controller_address() -> None:
5334
cfg = _worker_config()
5435
cfg.controller_address = ""
@@ -57,18 +38,6 @@ def test_build_worker_bootstrap_script_requires_controller_address() -> None:
5738
build_worker_bootstrap_script(cfg)
5839

5940

60-
def test_build_worker_bootstrap_script_embeds_worker_config_json() -> None:
61-
"""WorkerConfig fields appear in the embedded JSON in the generated script."""
62-
cfg = _worker_config()
63-
cfg.task_env["IRIS_SCALE_GROUP"] = "west-group"
64-
65-
script = build_worker_bootstrap_script(cfg)
66-
67-
assert "IRIS_SCALE_GROUP" in script
68-
assert "west-group" in script
69-
assert "worker_config.json" in script
70-
71-
7241
def test_render_template_preserves_docker_templates() -> None:
7342
template = 'docker ps --format "{{.Names}} {{.Status}}" and {{ value }}'
7443
rendered = render_template(template, value="x")
@@ -140,25 +109,6 @@ def test_rewrite_ghcr_to_ar_remote_custom_mirror_repo() -> None:
140109
assert result == "us-docker.pkg.dev/proj/custom-mirror/org/image:v1"
141110

142111

143-
def test_build_controller_bootstrap_script_from_config_rewrites_ghcr_to_ar() -> None:
144-
config = config_pb2.IrisClusterConfig()
145-
config.controller.image = "ghcr.io/marin-community/iris-controller:latest"
146-
config.controller.gcp.zone = "europe-west4-b"
147-
config.controller.gcp.port = 10000
148-
config.platform.gcp.project_id = "hai-gcp-models"
149-
150-
def resolve_image(image: str, zone: str | None = None) -> str:
151-
return "europe-docker.pkg.dev/hai-gcp-models/ghcr-mirror/marin-community/iris-controller:latest"
152-
153-
script = build_controller_bootstrap_script_from_config(config, resolve_image=resolve_image)
154-
155-
assert (
156-
"Pulling image: europe-docker.pkg.dev/hai-gcp-models/ghcr-mirror/marin-community/iris-controller:latest"
157-
in script
158-
)
159-
assert 'sudo gcloud auth configure-docker "$AR_HOST" -q || true' in script
160-
161-
162112
# --- GcpWorkerProvider.resolve_image() tests ---
163113

164114

@@ -191,13 +141,6 @@ def test_gcp_provider_resolve_image_passthrough_non_ghcr() -> None:
191141
)
192142

193143

194-
def test_worker_bootstrap_tunes_network_sysctls() -> None:
195-
"""Worker bootstrap configures sysctl for expanded port range and TIME_WAIT reuse."""
196-
script = build_worker_bootstrap_script(_worker_config())
197-
assert 'sysctl -w net.ipv4.ip_local_port_range="1024 65535"' in script
198-
assert "sysctl -w net.ipv4.tcp_tw_reuse=1" in script
199-
200-
201144
def test_gcp_provider_resolve_image_requires_zone_for_ghcr() -> None:
202145
"""GcpWorkerProvider.resolve_image() raises when zone is missing for GHCR images."""
203146
provider = _make_gcp_worker_provider()

0 commit comments

Comments
 (0)