[iris] Replace gcloud CLI with REST API client in CloudGcpService by rjpower · Pull Request #4393 · marin-community/marin

rjpower · 2026-04-03T16:34:41Z

Replace all gcloud subprocess calls in CloudGcpService with direct
REST API calls using httpx + google.auth ADC. This eliminates the
gcloud CLI dependency for resource management and fixes CI failures where
gcloud alpha is not installed.

Changes

Inline HTTP client into CloudGcpService: Auth (ADC token caching),
pagination, error mapping, and operation polling are private methods on
CloudGcpService rather than a separate GCPApi class.
Enable external IPs on TPU VMs: The gcloud CLI defaults to
enableExternalIps=true; the REST API does not. Without this, TPU VMs
had no internet access and startup-scripts couldn't pull Docker images.
Use locations/- wildcard: Project-wide TPU listing uses a single
API call (locations/-) instead of iterating all known zones, matching
gcloud's --zone=- behavior.
Wait for async operations: VM insert and TPU create poll their
respective operations before describing the resource.
Add logging_read to GcpService protocol: Bootstrap log fetching
goes through the service boundary instead of shelling out to gcloud.

Test plan

154 unit tests pass (service validation, mock HTTP integration, platform bootstrap)
Cloud smoke test passes (GCP CI workflow)
TPU VMs boot with external IPs, startup-scripts execute, workers register

🤖 Generated with Claude Code

rjpower · 2026-04-03T16:35:06Z

🤖 Specification

Problem:
CloudGcpService shells out to gcloud CLI via subprocess for all GCP operations.
PR #4379 adds queued-resource operations using gcloud alpha, which requires
the alpha component to be installed. CI environments lack it, causing interactive
install prompts that fail in non-interactive mode (lib/iris/src/iris/cluster/providers/gcp/service.py).
Additionally, _fetch_bootstrap_logs in workers.py uses gcloud logging read via
subprocess, which is another gcloud dependency outside the service boundary.

Approach:
New file api.py: GCPApi class using httpx (direct dep) + google.auth (transitive
via gcsfs, now explicit). Handles ADC auth with token caching (same pattern as
GcpAccessTokenProvider in rpc/auth.py), pagination via nextPageToken, and error
mapping from HTTP status codes to domain exceptions (ResourceNotFoundError,
QuotaExhaustedError, InfraError).

service.py: CloudGcpService constructor takes optional GCPApi, creates one by
default. All 12 methods rewritten from subprocess.run to self._api.* calls.
vm_update_labels and vm_set_metadata become read-modify-write with fingerprints
(REST API requirement). TPU list with empty zones scans all known zones instead
of using --zone=- (REST API requires real zone in parent path).

workers.py: _fetch_bootstrap_logs now calls gcp_service.logging_read() instead of
subprocess. logging_read added to GcpService Protocol; CloudGcpService delegates
to api.logging_list_entries(), InMemoryGcpService returns [].

Key code:
Error mapping in api.py _classify_response: 404 -> ResourceNotFoundError,
429/RESOURCE_EXHAUSTED -> QuotaExhaustedError, everything else -> InfraError.
Delete operations treat 404 as success (idempotent). Token refresh uses
monotonic clock with 5min margin before expiry.

Tests:
23 new tests in test_gcp_api.py using httpx.MockTransport. Cover URL construction
for all endpoint families (TPU, QR, Compute, Logging), error mapping (404, 429,
500, non-JSON), pagination, auth header injection, token refresh, aggregatedList
flattening, and force-delete params. All 151 existing GCP provider tests pass unchanged.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50ff2d2668

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Extract GCPApi class (httpx + google.auth ADC) that handles auth, pagination, and error mapping for TPU v2, Compute v1, and Cloud Logging APIs. Rewrite CloudGcpService to delegate to GCPApi instead of subprocess gcloud calls. This eliminates the gcloud CLI dependency for resource management, fixing CI failures from gcloud alpha not being installed. Add logging_read to the GcpService Protocol so bootstrap log fetching goes through the same boundary.

vm_create and tpu_create called the REST API (which returns async operations) then immediately tried to describe the resource. The old gcloud CLI blocked until operations completed; the REST API does not. This caused workers to fail with "created but could not be described" because the VM/TPU wasn't visible yet. Add operation polling to GCPApi (_wait_zone_operation for Compute, _wait_tpu_operation for TPU LROs), with instance_insert_wait and tpu_create_wait convenience methods. Update CloudGcpService to use the waiting variants. Also fix _fetch_bootstrap_logs timestamp filter: Cloud Logging needs RFC3339 timestamps, not "-PT30M" duration literals. https://claude.ai/code/session_01L4bVGg6j4fw19RiADT1GhM

Tests should validate external behavior boundaries, not internal polling mechanics. Keep the service-level tests (vm_create_waits, tpu_create_waits) that verify the real contract: create succeeds even when the underlying API is async. https://claude.ai/code/session_01L4bVGg6j4fw19RiADT1GhM

The separate GCPApi class added indirection without reuse — it was only used by CloudGcpService. Inline all HTTP/auth/pagination/operation-polling logic directly into CloudGcpService and delete api.py. This also fixes the controller startup bottleneck where tpu_list with no zones would scan all 15+ KNOWN_GCP_ZONES sequentially, each requiring a REST API call. The controller's autoscaler list operations were taking minutes, causing the 600s worker timeout to expire before TPUs could even be created.

Matches gcloud's --zone=- behavior: a single API call to list TPUs across all zones instead of iterating each zone individually.

The gcloud CLI defaults to enableExternalIps=true when creating TPU VMs. The REST API does not — TPUs were created without external IPs, so the startup-script couldn't pull Docker images and workers never bootstrapped.

yonromai

Approved. The REST rewrite is directionally right and the new integration test file passes, but I found two remaining Compute Engine operation-wait regressions worth fixing before merge:

vm_delete() now returns before the delete is durable, which can race the controller-recreate path and accidentally hand back the VM we meant to replace.
vm_update_labels() / vm_set_metadata() now return before the mutation is visible, so controller discovery tags can lag behind start_controller() returning.

Validation run:

uv run --package iris --group dev pytest lib/iris/tests/cluster/providers/gcp/test_cloud_service_integration.py
mocked repro in .codex/pr-review/PR_4393/repro_async_ops.py

Generated with Codex

yonromai · 2026-04-04T00:07:10Z

-            if "not found" not in error.lower():
-                raise _classify_gcloud_error(error)
+        url = self._instance_url(zone, name)
+        resp = self._client.delete(url, headers=self._headers())


instances.delete also returns a zonal operation, but this method returns as soon as the delete request is accepted. That changes controller replacement semantics in start_controller(): we terminate an unhealthy controller and immediately recreate the same fixed VM name, so the follow-up vm_create() can race the still-running delete, hit already exists, and hand back the VM we meant to replace.

I reproduced that with a mock transport against the current code: after vm_delete(), vm_create() returned the old instance (10.0.0.1) while the delete was still in progress.

Recommended fix: capture the delete operation name here and wait for _wait_zone_operation() before returning, so terminate() preserves the old blocking behavior.

Generated with Codex

yonromai · 2026-04-04T00:07:10Z

+            headers=self._headers(),
+            json={"labels": current_labels, "labelFingerprint": fingerprint},
+        )
+        self._classify_response(resp)


setLabels returns a zonal operation, but we stop after the initial POST. The previous gcloud compute instances update path only returned once the mutation had landed; with the REST version, callers can observe stale instance state after vm_update_labels() returns. start_controller() is one concrete example because it sets discovery tags right before returning.

I hit the same issue with a mocked pending operation: the new label was still missing immediately after vm_update_labels() because nothing polled the operation to DONE. The same regression applies to vm_set_metadata() below.

Recommended fix: use the returned operation name and _wait_zone_operation() here and in vm_set_metadata() before returning.

Generated with Codex

vm_delete gains a wait parameter (default False) so controller replacement blocks until the old VM is fully gone, preventing a create-after-delete race on the fixed controller name. Worker deletions remain fire-and-forget. vm_update_labels and vm_set_metadata now always poll the returned operation to completion so callers observe consistent state on return.

) Replace all `gcloud` subprocess calls in `CloudGcpService` with direct REST API calls using `httpx` + `google.auth` ADC. This eliminates the gcloud CLI dependency for resource management and fixes CI failures where `gcloud alpha` is not installed. ## Changes - **Inline HTTP client into CloudGcpService**: Auth (ADC token caching), pagination, error mapping, and operation polling are private methods on `CloudGcpService` rather than a separate `GCPApi` class. - **Enable external IPs on TPU VMs**: The gcloud CLI defaults to `enableExternalIps=true`; the REST API does not. Without this, TPU VMs had no internet access and startup-scripts couldn't pull Docker images. - **Use `locations/-` wildcard**: Project-wide TPU listing uses a single API call (`locations/-`) instead of iterating all known zones, matching gcloud's `--zone=-` behavior. - **Wait for async operations**: VM insert and TPU create poll their respective operations before describing the resource. - **Add `logging_read` to GcpService protocol**: Bootstrap log fetching goes through the service boundary instead of shelling out to gcloud.

rjpower added the agent-generated Created by automation/agent label Apr 3, 2026

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/providers/gcp/service.py Outdated

Comment thread lib/iris/src/iris/cluster/providers/gcp/service.py

Comment thread lib/iris/src/iris/cluster/providers/gcp/workers.py Outdated

rjpower force-pushed the claude/zen-matsumoto branch from 50ff2d2 to 8e48926 Compare April 3, 2026 16:40

rjpower and others added 4 commits April 3, 2026 13:11

rjpower force-pushed the claude/zen-matsumoto branch from b1c1ad2 to ad57024 Compare April 3, 2026 20:36

rjpower added 2 commits April 3, 2026 13:38

[iris] Use locations/- wildcard for project-wide TPU listing

9fe55f2

Matches gcloud's --zone=- behavior: a single API call to list TPUs across all zones instead of iterating each zone individually.

[iris] Enable external IPs on TPU VMs created via REST API

01bdfd0

The gcloud CLI defaults to enableExternalIps=true when creating TPU VMs. The REST API does not — TPUs were created without external IPs, so the startup-script couldn't pull Docker images and workers never bootstrapped.

rjpower requested a review from yonromai April 3, 2026 23:12

rjpower changed the title ~~[iris] Replace gcloud CLI with GCPApi REST client~~ [iris] Replace gcloud CLI with REST API client in CloudGcpService Apr 3, 2026

Delete lib/iris/tests/cluster/providers/gcp/test_gcp_api.py

fd0f270

yonromai approved these changes Apr 4, 2026

View reviewed changes

rjpower merged commit 836eb81 into main Apr 6, 2026
40 of 41 checks passed

rjpower deleted the claude/zen-matsumoto branch April 6, 2026 23:45

dlwh mentioned this pull request Apr 15, 2026

[Iris] Fix restart_worker slice loss in cloud smoke #4776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Replace gcloud CLI with REST API client in CloudGcpService#4393

[iris] Replace gcloud CLI with REST API client in CloudGcpService#4393
rjpower merged 8 commits intomainfrom
claude/zen-matsumoto

rjpower commented Apr 3, 2026 •

edited

Loading

Uh oh!

rjpower commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yonromai left a comment

Uh oh!

yonromai Apr 4, 2026

Uh oh!

yonromai Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rjpower commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Test plan

Uh oh!

rjpower commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rjpower commented Apr 3, 2026 •

edited

Loading