Conversation
GCP host bootstrap assumed gcloud was immediately available during Artifact Registry auth, which let a snap refresh abort worker startup before iris-worker ever launched. Add retry coverage in test_bootstrap.py and wait/retry both docker auth and image pulls so transient host package churn does not strand a worker.
|
Claude finished @dlwh's task in 2m 27s —— View job Review of PR #4813: [iris] Fix GCP bootstrap gcloud race
Verdict: Looks good — this is a solid fix for a real production race condition. What this PR doesAdds retry logic to both the worker and controller bootstrap scripts so that:
What I checked
Suggestions (non-blocking)
No correctness issues found. Ship it. |
| find_gcloud() { | ||
| if command -v gcloud > /dev/null 2>&1; then | ||
| command -v gcloud | ||
| return 0 | ||
| fi | ||
| if [ -x /snap/bin/gcloud ]; then | ||
| echo /snap/bin/gcloud | ||
| return 0 | ||
| fi | ||
| return 1 | ||
| } | ||
|
|
||
| configure_docker_auth_with_retries() { | ||
| local AR_HOST="$1" | ||
| local gcloud_bin="" | ||
| for attempt in $(seq 1 60); do | ||
| gcloud_bin=$(find_gcloud || true) | ||
| if [ -z "$gcloud_bin" ]; then | ||
| echo "[iris-controller] [3/5] Waiting for gcloud before configuring docker auth (attempt $attempt/60)" | ||
| sleep 2 | ||
| continue | ||
| fi | ||
| echo "[iris-controller] [3/5] Configuring docker auth for $AR_HOST (attempt $attempt/60)" | ||
| if sudo "$gcloud_bin" auth configure-docker "$AR_HOST" -q; then | ||
| return 0 | ||
| fi | ||
| echo "[iris-controller] [3/5] gcloud auth configure-docker failed for $AR_HOST (attempt $attempt/60)" | ||
| sleep 2 | ||
| done | ||
| echo "[iris-controller] [3/5] ERROR: Failed to configure docker auth for $AR_HOST after 60 attempts" | ||
| return 1 | ||
| } | ||
|
|
||
| pull_docker_image_with_retries() { | ||
| local image="$1" | ||
| for attempt in $(seq 1 5); do | ||
| if sudo docker pull "$image"; then | ||
| return 0 | ||
| fi | ||
| echo "[iris-controller] [4/5] Docker pull failed (attempt $attempt/5)" | ||
| if [ "$attempt" -lt 5 ]; then | ||
| sleep 5 | ||
| fi | ||
| done | ||
| echo "[iris-controller] [4/5] ERROR: Failed to pull image after 5 attempts: $image" | ||
| return 1 | ||
| } |
There was a problem hiding this comment.
Nit: The three shell functions (find_gcloud, configure_docker_auth_with_retries, pull_docker_image_with_retries) are duplicated nearly verbatim between the worker and controller templates (~48 lines each), differing only in the log prefix ([iris-init] vs [iris-controller] [3/5]).
One option to reduce drift: extract a shared _DOCKER_HELPERS_TEMPLATE that takes {{ log_prefix }} as a parameter, then render it into both scripts. That way a future fix (e.g., changing retry counts or adding a new fallback path) only needs to happen once.
Not blocking — the duplication is tolerable for a targeted production fix, but worth cleaning up as a follow-up.
| else | ||
| echo "[iris-init] Warning: gcloud not found; AR pull may fail without prior auth" | ||
| fi | ||
| configure_docker_auth_with_retries "$AR_HOST" |
There was a problem hiding this comment.
Worth noting: this is a deliberate behavior change. The old code used || true on gcloud auth configure-docker, silently continuing if auth failed. The new code retries up to 60 times and then fails hard (return 1 under set -e kills the script).
This is the right call — a silent auth failure just delays the crash to docker pull, where the error message is less obvious. Failing explicitly after retries with a clear error message is better for debugging.
| assert "if [ -x /snap/bin/gcloud ]; then" in script | ||
| assert 'echo "[iris-init] Waiting for gcloud before configuring docker auth (attempt $attempt/60)"' in script | ||
| assert 'echo "[iris-init] Configuring docker auth for $AR_HOST (attempt $attempt/60)"' in script | ||
| assert 'echo "[iris-init] Docker pull failed (attempt $attempt/5)"' in script | ||
| assert 'echo "[iris-init] ERROR: Failed to configure docker auth for $AR_HOST after 60 attempts"' in script | ||
| assert "sleep 2" in script | ||
| assert "sleep 5" in script |
There was a problem hiding this comment.
Minor: these tests assert on exact log message strings embedded in the generated script, which means any log wording change requires updating both the template and the test. This is a common trade-off for template-based tests and is fine here — just flagging that these assertions will be brittle if the log format evolves.
An alternative would be to assert on structural markers (e.g., "find_gcloud()" in script, "configure_docker_auth_with_retries" in script) rather than full log lines, but the current approach gives more confidence that the full retry machinery is present.
|
Probably worth getting rid of the duplication & the test slop |
🤖 Summary:
gcloudto become available before configuring Artifact Registry auth during bootstrap/snap/bin/gcloudand retry both auth anddocker pullso transient snap/package churn does not strand a worker or controller before containers startTests:
uv run --package marin-iris --group dev python -m pytest lib/iris/tests/cluster/providers/gcp/test_bootstrap.py -x --timeout=120./infra/pre-commit.py --all-files --fix