Skip to content

[iris] Fix VM bootstrap monitor killing healthy workers#4180

Merged
rjpower merged 1 commit intomainfrom
fix/vm-bootstrap-split-timeout-health-probe
Mar 26, 2026
Merged

[iris] Fix VM bootstrap monitor killing healthy workers#4180
rjpower merged 1 commit intomainfrom
fix/vm-bootstrap-split-timeout-health-probe

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 26, 2026

  • Split the shared 600s deadline in _run_vm_slice_bootstrap into independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete, 300s) timeouts
  • Add HTTP /health probing as the primary bootstrap completion signal instead of relying solely on GCE serial port text matching ("Bootstrap complete")
  • Serial port polling retained for diagnostics and error detection

Root cause

A worker VM registered at 17:39:12 and was serving tasks, but the bootstrap monitor's shared 600s deadline expired at 17:47:33 because the "Bootstrap complete" line never appeared in GCE serial port output (likely a flush/buffering issue). This killed a healthy worker and its running task.

The serial port approach is inherently fragile: the GCE serial port API uses byte offsets but the poller tracked Python character offsets (len(output)), and the output depends on the metadata_script_runner flushing before exit 0.

Split the shared 600s deadline into independent phase 1 (cloud READY,
600s) and phase 2 (bootstrap complete, 300s) timeouts so a slow VM
startup cannot starve the bootstrap check.

Add HTTP health probing as the primary bootstrap completion signal
instead of relying solely on GCE serial port text matching, which is
fragile due to buffering and byte/char offset drift. Serial port
polling is retained for diagnostics and error detection.

Fixes a bug where a worker registered and accepted tasks successfully,
but the bootstrap monitor timed out 8 minutes later because
"Bootstrap complete" never appeared in serial port output, killing
the worker and its running task.
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Mar 26, 2026

Claude finished @rjpower's task in 1m 18s —— View job


Review: PR #4180 — Fix VM bootstrap monitor killing healthy workers

Splits the shared 600s deadline into independent phase 1 (cloud READY) and phase 2 (bootstrap complete) timeouts, and adds HTTP /health probing as the primary bootstrap completion signal. The implementation matches the stated objectives and root cause analysis.

Specification check: PR description serves as spec. All five described test scenarios exist and test meaningful behavior (mocked health probe, serial port injection via the fake, independent timeout verification). Tests exercise both success paths and both failure modes.

Findings:

  • Bug: serial port offset tracking uses Python len() on characters, not GCE byte offsets. The PR description itself identifies this as a known fragility of the old code ("the GCE serial port API uses byte offsets but the poller tracked Python character offsets"). The new code at workers.py:700 still does serial_offset += len(output), which counts Python characters. The fake's vm_get_serial_port_output uses full_output[start:] (character slicing), so tests pass, but against real GCE the offset mismatch persists. This is a pre-existing bug carried forward — not a regression introduced by this PR — but worth noting since the PR description calls it out as a known issue. Fix this →

  • Observation: bootstrap_complete / bootstrap_errored are re-initialized inside the if output: block on every poll iteration (workers.py:701-702). This is correct — each poll processes only the new chunk of serial output since last serial_offset. The re-initialization scope is intentional and correct.

No bugs, race conditions, or guideline violations found in the new code. The change is clean and well-scoped.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eea25a4e81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/providers/gcp/workers.py
@rjpower rjpower requested a review from yonromai March 26, 2026 18:48
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

The split timeout logic and the added VM bootstrap coverage look right, and the targeted GCP provider tests passed locally. I did not find any net-new issues beyond the existing inline thread on the first-success /health exit path, so I am approving without duplicating that feedback.

Generated with Codex

@rjpower rjpower merged commit 079e8cd into main Mar 26, 2026
51 checks passed
@rjpower rjpower deleted the fix/vm-bootstrap-split-timeout-health-probe branch March 26, 2026 19:02
ravwojdyla pushed a commit that referenced this pull request Mar 26, 2026
- Split the shared 600s deadline in `_run_vm_slice_bootstrap` into
independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete,
300s) timeouts
- Add HTTP `/health` probing as the primary bootstrap completion signal
instead of relying solely on GCE serial port text matching ("Bootstrap
complete")
- Serial port polling retained for diagnostics and error detection

## Root cause
A worker VM registered at 17:39:12 and was serving tasks, but the
bootstrap monitor's shared 600s deadline expired at 17:47:33 because the
"Bootstrap complete" line never appeared in GCE serial port output
(likely a flush/buffering issue). This killed a healthy worker and its
running task.

The serial port approach is inherently fragile: the GCE serial port API
uses byte offsets but the poller tracked Python character offsets
(`len(output)`), and the output depends on the metadata_script_runner
flushing before `exit 0`.
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
- Split the shared 600s deadline in `_run_vm_slice_bootstrap` into
independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete,
300s) timeouts
- Add HTTP `/health` probing as the primary bootstrap completion signal
instead of relying solely on GCE serial port text matching ("Bootstrap
complete")
- Serial port polling retained for diagnostics and error detection

## Root cause
A worker VM registered at 17:39:12 and was serving tasks, but the
bootstrap monitor's shared 600s deadline expired at 17:47:33 because the
"Bootstrap complete" line never appeared in GCE serial port output
(likely a flush/buffering issue). This killed a healthy worker and its
running task.

The serial port approach is inherently fragile: the GCE serial port API
uses byte offsets but the poller tracked Python character offsets
(`len(output)`), and the output depends on the metadata_script_runner
flushing before `exit 0`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants