[iris] Fix VM bootstrap monitor killing healthy workers#4180
Conversation
Split the shared 600s deadline into independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete, 300s) timeouts so a slow VM startup cannot starve the bootstrap check. Add HTTP health probing as the primary bootstrap completion signal instead of relying solely on GCE serial port text matching, which is fragile due to buffering and byte/char offset drift. Serial port polling is retained for diagnostics and error detection. Fixes a bug where a worker registered and accepted tasks successfully, but the bootstrap monitor timed out 8 minutes later because "Bootstrap complete" never appeared in serial port output, killing the worker and its running task.
|
Claude finished @rjpower's task in 1m 18s —— View job Review: PR #4180 — Fix VM bootstrap monitor killing healthy workersSplits the shared 600s deadline into independent phase 1 (cloud READY) and phase 2 (bootstrap complete) timeouts, and adds HTTP Specification check: PR description serves as spec. All five described test scenarios exist and test meaningful behavior (mocked health probe, serial port injection via the fake, independent timeout verification). Tests exercise both success paths and both failure modes. Findings:
No bugs, race conditions, or guideline violations found in the new code. The change is clean and well-scoped. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eea25a4e81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
yonromai
left a comment
There was a problem hiding this comment.
Approved.
The split timeout logic and the added VM bootstrap coverage look right, and the targeted GCP provider tests passed locally. I did not find any net-new issues beyond the existing inline thread on the first-success /health exit path, so I am approving without duplicating that feedback.
Generated with Codex
- Split the shared 600s deadline in `_run_vm_slice_bootstrap` into
independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete,
300s) timeouts
- Add HTTP `/health` probing as the primary bootstrap completion signal
instead of relying solely on GCE serial port text matching ("Bootstrap
complete")
- Serial port polling retained for diagnostics and error detection
## Root cause
A worker VM registered at 17:39:12 and was serving tasks, but the
bootstrap monitor's shared 600s deadline expired at 17:47:33 because the
"Bootstrap complete" line never appeared in GCE serial port output
(likely a flush/buffering issue). This killed a healthy worker and its
running task.
The serial port approach is inherently fragile: the GCE serial port API
uses byte offsets but the poller tracked Python character offsets
(`len(output)`), and the output depends on the metadata_script_runner
flushing before `exit 0`.
- Split the shared 600s deadline in `_run_vm_slice_bootstrap` into
independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete,
300s) timeouts
- Add HTTP `/health` probing as the primary bootstrap completion signal
instead of relying solely on GCE serial port text matching ("Bootstrap
complete")
- Serial port polling retained for diagnostics and error detection
## Root cause
A worker VM registered at 17:39:12 and was serving tasks, but the
bootstrap monitor's shared 600s deadline expired at 17:47:33 because the
"Bootstrap complete" line never appeared in GCE serial port output
(likely a flush/buffering issue). This killed a healthy worker and its
running task.
The serial port approach is inherently fragile: the GCE serial port API
uses byte offsets but the poller tracked Python character offsets
(`len(output)`), and the output depends on the metadata_script_runner
flushing before `exit 0`.
_run_vm_slice_bootstrapinto independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete, 300s) timeouts/healthprobing as the primary bootstrap completion signal instead of relying solely on GCE serial port text matching ("Bootstrap complete")Root cause
A worker VM registered at 17:39:12 and was serving tasks, but the bootstrap monitor's shared 600s deadline expired at 17:47:33 because the "Bootstrap complete" line never appeared in GCE serial port output (likely a flush/buffering issue). This killed a healthy worker and its running task.
The serial port approach is inherently fragile: the GCE serial port API uses byte offsets but the poller tracked Python character offsets (
len(output)), and the output depends on the metadata_script_runner flushing beforeexit 0.