Skip to content

[iris] Persist controller bootstrap script to GCE startup-script metadata#5771

Merged
rjpower merged 1 commit into
mainfrom
iris-controller-bootstrap-persist-metadata
May 15, 2026
Merged

[iris] Persist controller bootstrap script to GCE startup-script metadata#5771
rjpower merged 1 commit into
mainfrom
iris-controller-bootstrap-persist-metadata

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented May 15, 2026

GcpStandaloneWorkerHandle.bootstrap now writes the rendered script to the instance's startup-script metadata before running it over SSH. Without this, a future GCE-initiated reboot (host maintenance, manual reset) re-runs the original script baked in at VM creation, pinning the controller to the day-one image. Mirrors finelog's gcp_restart behavior.

…data

GcpStandaloneWorkerHandle.bootstrap now writes the rendered script to the
instance's startup-script metadata before running it over SSH. Without this,
a future GCE-initiated reboot (host maintenance, manual reset) re-runs the
original script baked in at VM creation, pinning the controller to the
day-one image. Mirrors finelog's gcp_restart behavior.
@rjpower rjpower added the agent-generated Created by automation/agent label May 15, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 15, 2026

Claude finished @rjpower's task in 2m 22s —— View job


PR Review: Persist controller bootstrap script to GCE startup-script metadata

  • Read the changed file lib/iris/src/iris/cluster/providers/gcp/handles.py
  • Inspect surrounding context (bootstrap flow, metadata helpers, callers)
  • Verify behavior matches PR description (metadata set before SSH run)
  • Check for issues: error handling, idempotency, size limits, escaping, ordering
  • Post review

Verdict: LGTM. Small, well‑motivated change. The override is the right place, the call order is correct, and it leans on already‑tested helpers.

What works well

  • Order is correct (handles.py:224-225): set_metadata runs before super().bootstrap(script), so even if the SSH run later fails, the persisted metadata is consistent with the intended script. And if metadata persistence fails, the SSH bootstrap still proceeds.
  • Metadata write is non‑destructive: GcpService.vm_set_metadata (service.py:888-906) does a get → merge → setMetadata cycle, so existing items (e.g. controller-address set later in vm_lifecycle.py:400, or enable-oslogin) are preserved and the fingerprint is honored.
  • Error handling matches existing pattern: set_metadata (handles.py:211-217) already swallows InfraError to a warning. So a transient metadata‑API failure won't block the SSH bootstrap — consistent with set_labels behavior.
  • Idempotent script: The controller bootstrap (bootstrap.py:263+) stops/recreates the container and uses sudo throughout, so it's safe whether re‑run via SSH (as the iris user) or via GCE startup‑script (as root) on host maintenance reboot.
  • Restart path benefits automatically: restart_controller (vm_lifecycle.py:428-429) calls vm.bootstrap(...), which now picks up the persisted‑script behavior — exactly the day‑one‑image staleness the PR targets.

Minor observations (not blocking)

  1. Worker‑slice side effect, intended? GcpStandaloneWorkerHandle is also constructed for VM‑slice workers (handles.py:495). Workers are normally bootstrapped via the GCE startup‑script set at create time, not via SSH, so the override is a no‑op for the initial path. But restart_worker (_worker_base.py:108-116) → bootstrap() would now rewrite the worker's startup‑script metadata too. Probably what you want (same staleness fix), worth confirming the worker build_worker_bootstrap_script is also safe to re‑run via cold boot (it appears to be: workers.py:418-419 already sets it at creation, so reapplying is the equivalent of "the latest restart").
  2. No test coverage for the new override. A short unit test asserting that set_metadata({"startup-script": script}) is called before super().bootstrap(script) would catch regressions (e.g. the calls being swapped or dropped) — the change is 8 lines but the failure mode (silent loss of the persistence guarantee) is invisible until a host maintenance event months later. Optional, but cheap.
  3. Metadata size: GCE caps per‑value metadata at 256 KiB. Controller scripts are well under that today — just flagging in case the controller script ever grows substantially (e.g. embedding large config blobs).
  4. Comment style: 4‑line block comment is on the long side, but it explains a genuinely non‑obvious why (day‑one image pinning), which matches AGENTS.md's "comments for non‑obvious why" rule. Fine as‑is.
    · branch

@rjpower rjpower merged commit 7ad4e3d into main May 15, 2026
31 checks passed
@rjpower rjpower deleted the iris-controller-bootstrap-persist-metadata branch May 15, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant