Skip to content

[iris] Fix GCP bootstrap gcloud race#4813

Open
dlwh wants to merge 1 commit intomainfrom
codex/iris-gcloud-bootstrap-retry
Open

[iris] Fix GCP bootstrap gcloud race#4813
dlwh wants to merge 1 commit intomainfrom
codex/iris-gcloud-bootstrap-retry

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 16, 2026

🤖 Summary:

  • wait for host gcloud to become available before configuring Artifact Registry auth during bootstrap
  • fall back to /snap/bin/gcloud and retry both auth and docker pull so transient snap/package churn does not strand a worker or controller before containers start
  • add regression coverage for the generated bootstrap scripts

Tests:

  • uv run --package marin-iris --group dev python -m pytest lib/iris/tests/cluster/providers/gcp/test_bootstrap.py -x --timeout=120
  • ./infra/pre-commit.py --all-files --fix

GCP host bootstrap assumed gcloud was immediately available during Artifact Registry auth, which let a snap refresh abort worker startup before iris-worker ever launched. Add retry coverage in test_bootstrap.py and wait/retry both docker auth and image pulls so transient host package churn does not strand a worker.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 16, 2026 — with ChatGPT Codex Connector
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 16, 2026

Claude finished @dlwh's task in 2m 27s —— View job


Review of PR #4813: [iris] Fix GCP bootstrap gcloud race

  • Read changed files and understand the diff
  • Review bootstrap.py changes
  • Review test_bootstrap.py changes
  • Post review feedback

Verdict: Looks good — this is a solid fix for a real production race condition.

What this PR does

Adds retry logic to both the worker and controller bootstrap scripts so that:

  1. gcloud discovery polls up to 60 times (2s apart, ~120s max) for the gcloud binary to appear, with a /snap/bin/gcloud fallback path
  2. AR auth retries gcloud auth configure-docker within the same loop
  3. Docker pull retries up to 5 times (5s apart) for transient pull failures

What I checked

Area Status
Retry logic correctness The `find_gcloud
Behavior change from old code Old code used || true on auth (silent failure). New code fails hard after exhausting retries — this is an improvement. Auth failures now surface clearly instead of deferring to a cryptic docker pull error.
set -e interaction Worker: pull_docker_image_with_retries called bare → return 1 exits the script (same as old docker pull behavior). Controller: wrapped in if/else for graceful error handling (unchanged semantics).
Shell function placement Functions defined after some setup commands but before any call site — correct for bash.
Test coverage Tests verify retry machinery presence via log message assertions. Adequate for confirming the template includes the new functions.

Suggestions (non-blocking)

  1. Shell function duplication (inline comment): The three shell functions are copy-pasted between worker and controller templates, differing only in log prefix. Consider extracting a shared _DOCKER_HELPERS_TEMPLATE parameterized by {{ log_prefix }} to prevent drift. Fine as-is for a targeted fix; worth a follow-up.

  2. Test brittleness (inline comment): Tests assert on exact log message strings. Asserting on function names ("find_gcloud()" in script) would be more resilient to log wording changes, though the current approach does provide stronger verification.

No correctness issues found. Ship it.

Comment on lines +361 to +407
find_gcloud() {
if command -v gcloud > /dev/null 2>&1; then
command -v gcloud
return 0
fi
if [ -x /snap/bin/gcloud ]; then
echo /snap/bin/gcloud
return 0
fi
return 1
}

configure_docker_auth_with_retries() {
local AR_HOST="$1"
local gcloud_bin=""
for attempt in $(seq 1 60); do
gcloud_bin=$(find_gcloud || true)
if [ -z "$gcloud_bin" ]; then
echo "[iris-controller] [3/5] Waiting for gcloud before configuring docker auth (attempt $attempt/60)"
sleep 2
continue
fi
echo "[iris-controller] [3/5] Configuring docker auth for $AR_HOST (attempt $attempt/60)"
if sudo "$gcloud_bin" auth configure-docker "$AR_HOST" -q; then
return 0
fi
echo "[iris-controller] [3/5] gcloud auth configure-docker failed for $AR_HOST (attempt $attempt/60)"
sleep 2
done
echo "[iris-controller] [3/5] ERROR: Failed to configure docker auth for $AR_HOST after 60 attempts"
return 1
}

pull_docker_image_with_retries() {
local image="$1"
for attempt in $(seq 1 5); do
if sudo docker pull "$image"; then
return 0
fi
echo "[iris-controller] [4/5] Docker pull failed (attempt $attempt/5)"
if [ "$attempt" -lt 5 ]; then
sleep 5
fi
done
echo "[iris-controller] [4/5] ERROR: Failed to pull image after 5 attempts: $image"
return 1
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The three shell functions (find_gcloud, configure_docker_auth_with_retries, pull_docker_image_with_retries) are duplicated nearly verbatim between the worker and controller templates (~48 lines each), differing only in the log prefix ([iris-init] vs [iris-controller] [3/5]).

One option to reduce drift: extract a shared _DOCKER_HELPERS_TEMPLATE that takes {{ log_prefix }} as a parameter, then render it into both scripts. That way a future fix (e.g., changing retry counts or adding a new fallback path) only needs to happen once.

Not blocking — the duplication is tolerable for a targeted production fix, but worth cleaning up as a follow-up.

else
echo "[iris-init] Warning: gcloud not found; AR pull may fail without prior auth"
fi
configure_docker_auth_with_retries "$AR_HOST"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting: this is a deliberate behavior change. The old code used || true on gcloud auth configure-docker, silently continuing if auth failed. The new code retries up to 60 times and then fails hard (return 1 under set -e kills the script).

This is the right call — a silent auth failure just delays the crash to docker pull, where the error message is less obvious. Failing explicitly after retries with a clear error message is better for debugging.

Comment on lines +47 to +53
assert "if [ -x /snap/bin/gcloud ]; then" in script
assert 'echo "[iris-init] Waiting for gcloud before configuring docker auth (attempt $attempt/60)"' in script
assert 'echo "[iris-init] Configuring docker auth for $AR_HOST (attempt $attempt/60)"' in script
assert 'echo "[iris-init] Docker pull failed (attempt $attempt/5)"' in script
assert 'echo "[iris-init] ERROR: Failed to configure docker auth for $AR_HOST after 60 attempts"' in script
assert "sleep 2" in script
assert "sleep 5" in script
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: these tests assert on exact log message strings embedded in the generated script, which means any log wording change requires updating both the template and the test. This is a common trade-off for template-based tests and is fine here — just flagging that these assertions will be brittle if the log format evolves.

An alternative would be to assert on structural markers (e.g., "find_gcloud()" in script, "configure_docker_auth_with_retries" in script) rather than full log lines, but the current approach gives more confidence that the full retry machinery is present.

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Apr 16, 2026

Probably worth getting rid of the duplication & the test slop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants