Commit 2128575
authored
iris/gcp: retry TPU worker docker pull through snap/AR-auth races (#6567)
a v4-1024 reserved slice (20260623-0144-c52315c3) sat in "booting" with
122/128 workers healthy and 6 never presenting. GCP reported the slice
READY/HEALTHY with no symptoms and every VM had an IP — the gap was
entirely at the Iris worker-agent layer.
Root cause: on tpu-ubuntu2204-base, gcloud ships as a snap that is
occasionally not yet usable when the worker startup script reaches the
docker_pull phase. The startup logs on the stranded workers showed two
variants of the same race:
- worker-46: "[iris-init] Warning: gcloud not found; AR pull may fail
without prior auth" — /snap/bin/gcloud not yet linked.
- worker-44: gcloud present and `configure-docker` even succeeded, but
`docker-credential-gcloud` failed mid-pull with "error: the required
argument <snap> was not provided".
In both cases docker fell back to an unauthenticated request and
Artifact Registry denied it:
denied: Unauthenticated request. ... permission
"artifactregistry.repositories.downloadArtifacts" on resource
".../repositories/ghcr-mirror"
startup-script exit status 1
Because the pull was a single `sudo docker pull` under `set -e`, run
BEFORE the self-healing `--restart=unless-stopped` worker container is
created, one transient denial killed the script and stranded the worker
permanently. The comments downstream promise pull races "self-heal", but
the pull is outside that loop. Its /health never came up, and once
enough siblings are stranded the slice health probe
(`_run_tpu_bootstrap`, 2h deadline for >=64 workers) reaps the ENTIRE
slice — discarding the 122 healthy workers — and recreates it, where it
can lose the race again.
A healthy worker (worker-0) hit the same slow snap but won the race: its
`configure-docker` took ~25s, then pulled fine. So ~95% of nodes survive
and a handful do not — matching the scattered 6/128.1 parent 360ff95 commit 2128575
2 files changed
Lines changed: 61 additions & 76 deletions
File tree
- lib/iris
- src/iris/cluster/backends/gcp
- tests/cluster/backends/gcp
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
207 | 207 | | |
208 | 208 | | |
209 | 209 | | |
210 | | - | |
211 | | - | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
212 | 213 | | |
213 | 214 | | |
214 | | - | |
215 | | - | |
216 | | - | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | 215 | | |
221 | 216 | | |
222 | | - | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
223 | 249 | | |
224 | 250 | | |
225 | 251 | | |
| |||
380 | 406 | | |
381 | 407 | | |
382 | 408 | | |
383 | | - | |
384 | | - | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
385 | 412 | | |
386 | 413 | | |
387 | | - | |
388 | | - | |
389 | | - | |
390 | | - | |
391 | | - | |
392 | | - | |
393 | 414 | | |
394 | 415 | | |
395 | | - | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
396 | 438 | | |
397 | 439 | | |
398 | | - | |
| 440 | + | |
399 | 441 | | |
400 | 442 | | |
401 | 443 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
11 | 10 | | |
12 | 11 | | |
13 | 12 | | |
| |||
31 | 30 | | |
32 | 31 | | |
33 | 32 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | 33 | | |
53 | 34 | | |
54 | 35 | | |
| |||
57 | 38 | | |
58 | 39 | | |
59 | 40 | | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | 41 | | |
73 | 42 | | |
74 | 43 | | |
| |||
140 | 109 | | |
141 | 110 | | |
142 | 111 | | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | 112 | | |
163 | 113 | | |
164 | 114 | | |
| |||
191 | 141 | | |
192 | 142 | | |
193 | 143 | | |
194 | | - | |
195 | | - | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | 144 | | |
202 | 145 | | |
203 | 146 | | |
| |||
0 commit comments