[iris] Fix VM bootstrap monitor killing healthy workers by rjpower · Pull Request #4180 · marin-community/marin

rjpower · 2026-03-26T18:28:30Z

Split the shared 600s deadline in _run_vm_slice_bootstrap into independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete, 300s) timeouts
Add HTTP /health probing as the primary bootstrap completion signal instead of relying solely on GCE serial port text matching ("Bootstrap complete")
Serial port polling retained for diagnostics and error detection

Root cause

A worker VM registered at 17:39:12 and was serving tasks, but the bootstrap monitor's shared 600s deadline expired at 17:47:33 because the "Bootstrap complete" line never appeared in GCE serial port output (likely a flush/buffering issue). This killed a healthy worker and its running task.

The serial port approach is inherently fragile: the GCE serial port API uses byte offsets but the poller tracked Python character offsets (len(output)), and the output depends on the metadata_script_runner flushing before exit 0.

Split the shared 600s deadline into independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete, 300s) timeouts so a slow VM startup cannot starve the bootstrap check. Add HTTP health probing as the primary bootstrap completion signal instead of relying solely on GCE serial port text matching, which is fragile due to buffering and byte/char offset drift. Serial port polling is retained for diagnostics and error detection. Fixes a bug where a worker registered and accepted tasks successfully, but the bootstrap monitor timed out 8 minutes later because "Bootstrap complete" never appeared in serial port output, killing the worker and its running task.

claude · 2026-03-26T18:28:45Z

Claude finished @rjpower's task in 1m 18s —— View job

Review: PR #4180 — Fix VM bootstrap monitor killing healthy workers

Splits the shared 600s deadline into independent phase 1 (cloud READY) and phase 2 (bootstrap complete) timeouts, and adds HTTP /health probing as the primary bootstrap completion signal. The implementation matches the stated objectives and root cause analysis.

Specification check: PR description serves as spec. All five described test scenarios exist and test meaningful behavior (mocked health probe, serial port injection via the fake, independent timeout verification). Tests exercise both success paths and both failure modes.

Findings:

Bug: serial port offset tracking uses Python len() on characters, not GCE byte offsets. The PR description itself identifies this as a known fragility of the old code ("the GCE serial port API uses byte offsets but the poller tracked Python character offsets"). The new code at workers.py:700 still does serial_offset += len(output), which counts Python characters. The fake's vm_get_serial_port_output uses full_output[start:] (character slicing), so tests pass, but against real GCE the offset mismatch persists. This is a pre-existing bug carried forward — not a regression introduced by this PR — but worth noting since the PR description calls it out as a known issue. Fix this →
Observation: bootstrap_complete / bootstrap_errored are re-initialized inside the if output: block on every poll iteration (workers.py:701-702). This is correct — each poll processes only the new chunk of serial output since last serial_offset. The re-initialization scope is intentional and correct.

No bugs, race conditions, or guideline violations found in the new code. The change is clean and well-scoped.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eea25a4e81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

yonromai

Approved.

The split timeout logic and the added VM bootstrap coverage look right, and the targeted GCP provider tests passed locally. I did not find any net-new issues beyond the existing inline thread on the first-success /health exit path, so I am approving without duplicating that feedback.

_{Generated with Codex}

- Split the shared 600s deadline in `_run_vm_slice_bootstrap` into independent phase 1 (cloud READY, 600s) and phase 2 (bootstrap complete, 300s) timeouts - Add HTTP `/health` probing as the primary bootstrap completion signal instead of relying solely on GCE serial port text matching ("Bootstrap complete") - Serial port polling retained for diagnostics and error detection ## Root cause A worker VM registered at 17:39:12 and was serving tasks, but the bootstrap monitor's shared 600s deadline expired at 17:47:33 because the "Bootstrap complete" line never appeared in GCE serial port output (likely a flush/buffering issue). This killed a healthy worker and its running task. The serial port approach is inherently fragile: the GCE serial port API uses byte offsets but the poller tracked Python character offsets (`len(output)`), and the output depends on the metadata_script_runner flushing before `exit 0`.

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/providers/gcp/workers.py

rjpower requested a review from yonromai March 26, 2026 18:48

yonromai approved these changes Mar 26, 2026

View reviewed changes

rjpower merged commit 079e8cd into main Mar 26, 2026
51 checks passed

rjpower deleted the fix/vm-bootstrap-split-timeout-health-probe branch March 26, 2026 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Fix VM bootstrap monitor killing healthy workers#4180

[iris] Fix VM bootstrap monitor killing healthy workers#4180
rjpower merged 1 commit intomainfrom
fix/vm-bootstrap-split-timeout-health-probe

rjpower commented Mar 26, 2026 •

edited

Loading

Uh oh!

claude Bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

yonromai left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Uh oh!

claude Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #4180 — Fix VM bootstrap monitor killing healthy workers

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Mar 26, 2026 •

edited

Loading

claude Bot commented Mar 26, 2026 •

edited

Loading