Skip to content

[iris] Replace gcloud CLI with REST API client in CloudGcpService#4393

Merged
rjpower merged 8 commits intomainfrom
claude/zen-matsumoto
Apr 6, 2026
Merged

[iris] Replace gcloud CLI with REST API client in CloudGcpService#4393
rjpower merged 8 commits intomainfrom
claude/zen-matsumoto

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 3, 2026

Replace all gcloud subprocess calls in CloudGcpService with direct
REST API calls using httpx + google.auth ADC. This eliminates the
gcloud CLI dependency for resource management and fixes CI failures where
gcloud alpha is not installed.

Changes

  • Inline HTTP client into CloudGcpService: Auth (ADC token caching),
    pagination, error mapping, and operation polling are private methods on
    CloudGcpService rather than a separate GCPApi class.
  • Enable external IPs on TPU VMs: The gcloud CLI defaults to
    enableExternalIps=true; the REST API does not. Without this, TPU VMs
    had no internet access and startup-scripts couldn't pull Docker images.
  • Use locations/- wildcard: Project-wide TPU listing uses a single
    API call (locations/-) instead of iterating all known zones, matching
    gcloud's --zone=- behavior.
  • Wait for async operations: VM insert and TPU create poll their
    respective operations before describing the resource.
  • Add logging_read to GcpService protocol: Bootstrap log fetching
    goes through the service boundary instead of shelling out to gcloud.

Test plan

  • 154 unit tests pass (service validation, mock HTTP integration, platform bootstrap)
  • Cloud smoke test passes (GCP CI workflow)
  • TPU VMs boot with external IPs, startup-scripts execute, workers register

🤖 Generated with Claude Code

@rjpower rjpower added the agent-generated Created by automation/agent label Apr 3, 2026
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 3, 2026

🤖 Specification

Problem:
CloudGcpService shells out to gcloud CLI via subprocess for all GCP operations.
PR #4379 adds queued-resource operations using gcloud alpha, which requires
the alpha component to be installed. CI environments lack it, causing interactive
install prompts that fail in non-interactive mode (lib/iris/src/iris/cluster/providers/gcp/service.py).
Additionally, _fetch_bootstrap_logs in workers.py uses gcloud logging read via
subprocess, which is another gcloud dependency outside the service boundary.

Approach:
New file api.py: GCPApi class using httpx (direct dep) + google.auth (transitive
via gcsfs, now explicit). Handles ADC auth with token caching (same pattern as
GcpAccessTokenProvider in rpc/auth.py), pagination via nextPageToken, and error
mapping from HTTP status codes to domain exceptions (ResourceNotFoundError,
QuotaExhaustedError, InfraError).

service.py: CloudGcpService constructor takes optional GCPApi, creates one by
default. All 12 methods rewritten from subprocess.run to self._api.* calls.
vm_update_labels and vm_set_metadata become read-modify-write with fingerprints
(REST API requirement). TPU list with empty zones scans all known zones instead
of using --zone=- (REST API requires real zone in parent path).

workers.py: _fetch_bootstrap_logs now calls gcp_service.logging_read() instead of
subprocess. logging_read added to GcpService Protocol; CloudGcpService delegates
to api.logging_list_entries(), InMemoryGcpService returns [].

Key code:
Error mapping in api.py _classify_response: 404 -> ResourceNotFoundError,
429/RESOURCE_EXHAUSTED -> QuotaExhaustedError, everything else -> InfraError.
Delete operations treat 404 as success (idempotent). Token refresh uses
monotonic clock with 5min margin before expiry.

Tests:
23 new tests in test_gcp_api.py using httpx.MockTransport. Cover URL construction
for all endpoint families (TPU, QR, Compute, Logging), error mapping (404, 429,
500, non-JSON), pagination, auth header injection, token refresh, aggregatedList
flattening, and force-delete params. All 151 existing GCP provider tests pass unchanged.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 50ff2d2668

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/providers/gcp/service.py Outdated
Comment thread lib/iris/src/iris/cluster/providers/gcp/service.py
Comment thread lib/iris/src/iris/cluster/providers/gcp/workers.py Outdated
@rjpower rjpower force-pushed the claude/zen-matsumoto branch from 50ff2d2 to 8e48926 Compare April 3, 2026 16:40
rjpower and others added 4 commits April 3, 2026 13:11
Extract GCPApi class (httpx + google.auth ADC) that handles auth, pagination,
and error mapping for TPU v2, Compute v1, and Cloud Logging APIs. Rewrite
CloudGcpService to delegate to GCPApi instead of subprocess gcloud calls.
This eliminates the gcloud CLI dependency for resource management, fixing CI
failures from gcloud alpha not being installed. Add logging_read to the
GcpService Protocol so bootstrap log fetching goes through the same boundary.
vm_create and tpu_create called the REST API (which returns async
operations) then immediately tried to describe the resource. The old
gcloud CLI blocked until operations completed; the REST API does not.
This caused workers to fail with "created but could not be described"
because the VM/TPU wasn't visible yet.

Add operation polling to GCPApi (_wait_zone_operation for Compute,
_wait_tpu_operation for TPU LROs), with instance_insert_wait and
tpu_create_wait convenience methods. Update CloudGcpService to use
the waiting variants.

Also fix _fetch_bootstrap_logs timestamp filter: Cloud Logging needs
RFC3339 timestamps, not "-PT30M" duration literals.

https://claude.ai/code/session_01L4bVGg6j4fw19RiADT1GhM
Tests should validate external behavior boundaries, not internal
polling mechanics. Keep the service-level tests (vm_create_waits,
tpu_create_waits) that verify the real contract: create succeeds
even when the underlying API is async.

https://claude.ai/code/session_01L4bVGg6j4fw19RiADT1GhM
The separate GCPApi class added indirection without reuse — it was only
used by CloudGcpService. Inline all HTTP/auth/pagination/operation-polling
logic directly into CloudGcpService and delete api.py.

This also fixes the controller startup bottleneck where tpu_list with no
zones would scan all 15+ KNOWN_GCP_ZONES sequentially, each requiring a
REST API call. The controller's autoscaler list operations were taking
minutes, causing the 600s worker timeout to expire before TPUs could even
be created.
@rjpower rjpower force-pushed the claude/zen-matsumoto branch from b1c1ad2 to ad57024 Compare April 3, 2026 20:36
rjpower added 2 commits April 3, 2026 13:38
Matches gcloud's --zone=- behavior: a single API call to list TPUs
across all zones instead of iterating each zone individually.
The gcloud CLI defaults to enableExternalIps=true when creating TPU VMs.
The REST API does not — TPUs were created without external IPs, so the
startup-script couldn't pull Docker images and workers never bootstrapped.
@rjpower rjpower requested a review from yonromai April 3, 2026 23:12
@rjpower rjpower changed the title [iris] Replace gcloud CLI with GCPApi REST client [iris] Replace gcloud CLI with REST API client in CloudGcpService Apr 3, 2026
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The REST rewrite is directionally right and the new integration test file passes, but I found two remaining Compute Engine operation-wait regressions worth fixing before merge:

  • vm_delete() now returns before the delete is durable, which can race the controller-recreate path and accidentally hand back the VM we meant to replace.
  • vm_update_labels() / vm_set_metadata() now return before the mutation is visible, so controller discovery tags can lag behind start_controller() returning.

Validation run:

  • uv run --package iris --group dev pytest lib/iris/tests/cluster/providers/gcp/test_cloud_service_integration.py
  • mocked repro in .codex/pr-review/PR_4393/repro_async_ops.py

Generated with Codex

if "not found" not in error.lower():
raise _classify_gcloud_error(error)
url = self._instance_url(zone, name)
resp = self._client.delete(url, headers=self._headers())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instances.delete also returns a zonal operation, but this method returns as soon as the delete request is accepted. That changes controller replacement semantics in start_controller(): we terminate an unhealthy controller and immediately recreate the same fixed VM name, so the follow-up vm_create() can race the still-running delete, hit already exists, and hand back the VM we meant to replace.

I reproduced that with a mock transport against the current code: after vm_delete(), vm_create() returned the old instance (10.0.0.1) while the delete was still in progress.

Recommended fix: capture the delete operation name here and wait for _wait_zone_operation() before returning, so terminate() preserves the old blocking behavior.

Generated with Codex

headers=self._headers(),
json={"labels": current_labels, "labelFingerprint": fingerprint},
)
self._classify_response(resp)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setLabels returns a zonal operation, but we stop after the initial POST. The previous gcloud compute instances update path only returned once the mutation had landed; with the REST version, callers can observe stale instance state after vm_update_labels() returns. start_controller() is one concrete example because it sets discovery tags right before returning.

I hit the same issue with a mocked pending operation: the new label was still missing immediately after vm_update_labels() because nothing polled the operation to DONE. The same regression applies to vm_set_metadata() below.

Recommended fix: use the returned operation name and _wait_zone_operation() here and in vm_set_metadata() before returning.

Generated with Codex

vm_delete gains a wait parameter (default False) so controller
replacement blocks until the old VM is fully gone, preventing a
create-after-delete race on the fixed controller name. Worker
deletions remain fire-and-forget. vm_update_labels and
vm_set_metadata now always poll the returned operation to
completion so callers observe consistent state on return.
@rjpower rjpower merged commit 836eb81 into main Apr 6, 2026
40 of 41 checks passed
@rjpower rjpower deleted the claude/zen-matsumoto branch April 6, 2026 23:45
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
)

Replace all `gcloud` subprocess calls in `CloudGcpService` with direct
REST API calls using `httpx` + `google.auth` ADC. This eliminates the
gcloud CLI dependency for resource management and fixes CI failures
where
`gcloud alpha` is not installed.

## Changes

- **Inline HTTP client into CloudGcpService**: Auth (ADC token caching),
pagination, error mapping, and operation polling are private methods on
  `CloudGcpService` rather than a separate `GCPApi` class.
- **Enable external IPs on TPU VMs**: The gcloud CLI defaults to
  `enableExternalIps=true`; the REST API does not. Without this, TPU VMs
had no internet access and startup-scripts couldn't pull Docker images.
- **Use `locations/-` wildcard**: Project-wide TPU listing uses a single
API call (`locations/-`) instead of iterating all known zones, matching
  gcloud's `--zone=-` behavior.
- **Wait for async operations**: VM insert and TPU create poll their
  respective operations before describing the resource.
- **Add `logging_read` to GcpService protocol**: Bootstrap log fetching
  goes through the service boundary instead of shelling out to gcloud.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants