[iris] Fix GCP bootstrap gcloud race by dlwh · Pull Request #4813 · marin-community/marin

dlwh · 2026-04-16T01:44:28Z

🤖 Summary:

wait for host gcloud to become available before configuring Artifact Registry auth during bootstrap
fall back to /snap/bin/gcloud and retry both auth and docker pull so transient snap/package churn does not strand a worker or controller before containers start
add regression coverage for the generated bootstrap scripts

Tests:

uv run --package marin-iris --group dev python -m pytest lib/iris/tests/cluster/providers/gcp/test_bootstrap.py -x --timeout=120
./infra/pre-commit.py --all-files --fix

GCP host bootstrap assumed gcloud was immediately available during Artifact Registry auth, which let a snap refresh abort worker startup before iris-worker ever launched. Add retry coverage in test_bootstrap.py and wait/retry both docker auth and image pulls so transient host package churn does not strand a worker.

claude · 2026-04-16T01:44:43Z

Claude finished @dlwh's task in 2m 27s —— View job

Review of PR #4813: [iris] Fix GCP bootstrap gcloud race

Read changed files and understand the diff
Review bootstrap.py changes
Review test_bootstrap.py changes
Post review feedback

Verdict: Looks good — this is a solid fix for a real production race condition.

What this PR does

Adds retry logic to both the worker and controller bootstrap scripts so that:

gcloud discovery polls up to 60 times (2s apart, ~120s max) for the gcloud binary to appear, with a /snap/bin/gcloud fallback path
AR auth retries gcloud auth configure-docker within the same loop
Docker pull retries up to 5 times (5s apart) for transient pull failures

What I checked

Area	Status
Retry logic correctness	The `find_gcloud
Behavior change from old code	Old code used `\|\| true` on auth (silent failure). New code fails hard after exhausting retries — this is an improvement. Auth failures now surface clearly instead of deferring to a cryptic docker pull error.
`set -e` interaction	Worker: `pull_docker_image_with_retries` called bare → `return 1` exits the script (same as old `docker pull` behavior). Controller: wrapped in `if/else` for graceful error handling (unchanged semantics).
Shell function placement	Functions defined after some setup commands but before any call site — correct for bash.
Test coverage	Tests verify retry machinery presence via log message assertions. Adequate for confirming the template includes the new functions.

Suggestions (non-blocking)

Shell function duplication (inline comment): The three shell functions are copy-pasted between worker and controller templates, differing only in log prefix. Consider extracting a shared _DOCKER_HELPERS_TEMPLATE parameterized by {{ log_prefix }} to prevent drift. Fine as-is for a targeted fix; worth a follow-up.
Test brittleness (inline comment): Tests assert on exact log message strings. Asserting on function names ("find_gcloud()" in script) would be more resilient to log wording changes, though the current approach does provide stronger verification.

No correctness issues found. Ship it.

claude · 2026-04-16T01:46:43Z

+find_gcloud() {
+    if command -v gcloud > /dev/null 2>&1; then
+        command -v gcloud
+        return 0
+    fi
+    if [ -x /snap/bin/gcloud ]; then
+        echo /snap/bin/gcloud
+        return 0
+    fi
+    return 1
+}
+
+configure_docker_auth_with_retries() {
+    local AR_HOST="$1"
+    local gcloud_bin=""
+    for attempt in $(seq 1 60); do
+        gcloud_bin=$(find_gcloud || true)
+        if [ -z "$gcloud_bin" ]; then
+            echo "[iris-controller] [3/5] Waiting for gcloud before configuring docker auth (attempt $attempt/60)"
+            sleep 2
+            continue
+        fi
+        echo "[iris-controller] [3/5] Configuring docker auth for $AR_HOST (attempt $attempt/60)"
+        if sudo "$gcloud_bin" auth configure-docker "$AR_HOST" -q; then
+            return 0
+        fi
+        echo "[iris-controller] [3/5] gcloud auth configure-docker failed for $AR_HOST (attempt $attempt/60)"
+        sleep 2
+    done
+    echo "[iris-controller] [3/5] ERROR: Failed to configure docker auth for $AR_HOST after 60 attempts"
+    return 1
+}
+
+pull_docker_image_with_retries() {
+    local image="$1"
+    for attempt in $(seq 1 5); do
+        if sudo docker pull "$image"; then
+            return 0
+        fi
+        echo "[iris-controller] [4/5] Docker pull failed (attempt $attempt/5)"
+        if [ "$attempt" -lt 5 ]; then
+            sleep 5
+        fi
+    done
+    echo "[iris-controller] [4/5] ERROR: Failed to pull image after 5 attempts: $image"
+    return 1
+}


Nit: The three shell functions (find_gcloud, configure_docker_auth_with_retries, pull_docker_image_with_retries) are duplicated nearly verbatim between the worker and controller templates (~48 lines each), differing only in the log prefix ([iris-init] vs [iris-controller] [3/5]).

One option to reduce drift: extract a shared _DOCKER_HELPERS_TEMPLATE that takes {{ log_prefix }} as a parameter, then render it into both scripts. That way a future fix (e.g., changing retry counts or adding a new fallback path) only needs to happen once.

Not blocking — the duplication is tolerable for a targeted production fix, but worth cleaning up as a follow-up.

claude · 2026-04-16T01:46:50Z

-    else
-        echo "[iris-init] Warning: gcloud not found; AR pull may fail without prior auth"
-    fi
+    configure_docker_auth_with_retries "$AR_HOST"


Worth noting: this is a deliberate behavior change. The old code used || true on gcloud auth configure-docker, silently continuing if auth failed. The new code retries up to 60 times and then fails hard (return 1 under set -e kills the script).

This is the right call — a silent auth failure just delays the crash to docker pull, where the error message is less obvious. Failing explicitly after retries with a clear error message is better for debugging.

claude · 2026-04-16T01:46:57Z

+    assert "if [ -x /snap/bin/gcloud ]; then" in script
+    assert 'echo "[iris-init] Waiting for gcloud before configuring docker auth (attempt $attempt/60)"' in script
+    assert 'echo "[iris-init] Configuring docker auth for $AR_HOST (attempt $attempt/60)"' in script
+    assert 'echo "[iris-init] Docker pull failed (attempt $attempt/5)"' in script
+    assert 'echo "[iris-init] ERROR: Failed to configure docker auth for $AR_HOST after 60 attempts"' in script
+    assert "sleep 2" in script
+    assert "sleep 5" in script


Minor: these tests assert on exact log message strings embedded in the generated script, which means any log wording change requires updating both the template and the test. This is a common trade-off for template-based tests and is fine here — just flagging that these assertions will be brittle if the log format evolves.

An alternative would be to assert on structural markers (e.g., "find_gcloud()" in script, "configure_docker_auth_with_retries" in script) rather than full log lines, but the current approach gives more confidence that the full retry machinery is present.

rjpower · 2026-04-16T02:07:38Z

Probably worth getting rid of the duplication & the test slop

dlwh added the agent-generated Created by automation/agent label Apr 16, 2026 — with ChatGPT Codex Connector

claude bot reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Fix GCP bootstrap gcloud race#4813

[iris] Fix GCP bootstrap gcloud race#4813
dlwh wants to merge 1 commit intomainfrom
codex/iris-gcloud-bootstrap-retry

dlwh commented Apr 16, 2026

Uh oh!

claude bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

claude bot Apr 16, 2026

Uh oh!

rjpower commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dlwh commented Apr 16, 2026

Uh oh!

claude bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #4813: [iris] Fix GCP bootstrap gcloud race

What this PR does

What I checked

Suggestions (non-blocking)

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 16, 2026 •

edited

Loading