[iris] Replace gcloud CLI with REST API client in CloudGcpService#4393
[iris] Replace gcloud CLI with REST API client in CloudGcpService#4393
Conversation
|
🤖 Specification Problem: Approach: service.py: CloudGcpService constructor takes optional GCPApi, creates one by workers.py: _fetch_bootstrap_logs now calls gcp_service.logging_read() instead of Key code: Tests: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 50ff2d2668
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
50ff2d2 to
8e48926
Compare
Extract GCPApi class (httpx + google.auth ADC) that handles auth, pagination, and error mapping for TPU v2, Compute v1, and Cloud Logging APIs. Rewrite CloudGcpService to delegate to GCPApi instead of subprocess gcloud calls. This eliminates the gcloud CLI dependency for resource management, fixing CI failures from gcloud alpha not being installed. Add logging_read to the GcpService Protocol so bootstrap log fetching goes through the same boundary.
vm_create and tpu_create called the REST API (which returns async operations) then immediately tried to describe the resource. The old gcloud CLI blocked until operations completed; the REST API does not. This caused workers to fail with "created but could not be described" because the VM/TPU wasn't visible yet. Add operation polling to GCPApi (_wait_zone_operation for Compute, _wait_tpu_operation for TPU LROs), with instance_insert_wait and tpu_create_wait convenience methods. Update CloudGcpService to use the waiting variants. Also fix _fetch_bootstrap_logs timestamp filter: Cloud Logging needs RFC3339 timestamps, not "-PT30M" duration literals. https://claude.ai/code/session_01L4bVGg6j4fw19RiADT1GhM
Tests should validate external behavior boundaries, not internal polling mechanics. Keep the service-level tests (vm_create_waits, tpu_create_waits) that verify the real contract: create succeeds even when the underlying API is async. https://claude.ai/code/session_01L4bVGg6j4fw19RiADT1GhM
The separate GCPApi class added indirection without reuse — it was only used by CloudGcpService. Inline all HTTP/auth/pagination/operation-polling logic directly into CloudGcpService and delete api.py. This also fixes the controller startup bottleneck where tpu_list with no zones would scan all 15+ KNOWN_GCP_ZONES sequentially, each requiring a REST API call. The controller's autoscaler list operations were taking minutes, causing the 600s worker timeout to expire before TPUs could even be created.
b1c1ad2 to
ad57024
Compare
Matches gcloud's --zone=- behavior: a single API call to list TPUs across all zones instead of iterating each zone individually.
The gcloud CLI defaults to enableExternalIps=true when creating TPU VMs. The REST API does not — TPUs were created without external IPs, so the startup-script couldn't pull Docker images and workers never bootstrapped.
yonromai
left a comment
There was a problem hiding this comment.
Approved. The REST rewrite is directionally right and the new integration test file passes, but I found two remaining Compute Engine operation-wait regressions worth fixing before merge:
vm_delete()now returns before the delete is durable, which can race the controller-recreate path and accidentally hand back the VM we meant to replace.vm_update_labels()/vm_set_metadata()now return before the mutation is visible, so controller discovery tags can lag behindstart_controller()returning.
Validation run:
uv run --package iris --group dev pytest lib/iris/tests/cluster/providers/gcp/test_cloud_service_integration.py- mocked repro in
.codex/pr-review/PR_4393/repro_async_ops.py
Generated with Codex
| if "not found" not in error.lower(): | ||
| raise _classify_gcloud_error(error) | ||
| url = self._instance_url(zone, name) | ||
| resp = self._client.delete(url, headers=self._headers()) |
There was a problem hiding this comment.
instances.delete also returns a zonal operation, but this method returns as soon as the delete request is accepted. That changes controller replacement semantics in start_controller(): we terminate an unhealthy controller and immediately recreate the same fixed VM name, so the follow-up vm_create() can race the still-running delete, hit already exists, and hand back the VM we meant to replace.
I reproduced that with a mock transport against the current code: after vm_delete(), vm_create() returned the old instance (10.0.0.1) while the delete was still in progress.
Recommended fix: capture the delete operation name here and wait for _wait_zone_operation() before returning, so terminate() preserves the old blocking behavior.
Generated with Codex
| headers=self._headers(), | ||
| json={"labels": current_labels, "labelFingerprint": fingerprint}, | ||
| ) | ||
| self._classify_response(resp) |
There was a problem hiding this comment.
setLabels returns a zonal operation, but we stop after the initial POST. The previous gcloud compute instances update path only returned once the mutation had landed; with the REST version, callers can observe stale instance state after vm_update_labels() returns. start_controller() is one concrete example because it sets discovery tags right before returning.
I hit the same issue with a mocked pending operation: the new label was still missing immediately after vm_update_labels() because nothing polled the operation to DONE. The same regression applies to vm_set_metadata() below.
Recommended fix: use the returned operation name and _wait_zone_operation() here and in vm_set_metadata() before returning.
Generated with Codex
vm_delete gains a wait parameter (default False) so controller replacement blocks until the old VM is fully gone, preventing a create-after-delete race on the fixed controller name. Worker deletions remain fire-and-forget. vm_update_labels and vm_set_metadata now always poll the returned operation to completion so callers observe consistent state on return.
) Replace all `gcloud` subprocess calls in `CloudGcpService` with direct REST API calls using `httpx` + `google.auth` ADC. This eliminates the gcloud CLI dependency for resource management and fixes CI failures where `gcloud alpha` is not installed. ## Changes - **Inline HTTP client into CloudGcpService**: Auth (ADC token caching), pagination, error mapping, and operation polling are private methods on `CloudGcpService` rather than a separate `GCPApi` class. - **Enable external IPs on TPU VMs**: The gcloud CLI defaults to `enableExternalIps=true`; the REST API does not. Without this, TPU VMs had no internet access and startup-scripts couldn't pull Docker images. - **Use `locations/-` wildcard**: Project-wide TPU listing uses a single API call (`locations/-`) instead of iterating all known zones, matching gcloud's `--zone=-` behavior. - **Wait for async operations**: VM insert and TPU create poll their respective operations before describing the resource. - **Add `logging_read` to GcpService protocol**: Bootstrap log fetching goes through the service boundary instead of shelling out to gcloud.
Replace all
gcloudsubprocess calls inCloudGcpServicewith directREST API calls using
httpx+google.authADC. This eliminates thegcloud CLI dependency for resource management and fixes CI failures where
gcloud alphais not installed.Changes
pagination, error mapping, and operation polling are private methods on
CloudGcpServicerather than a separateGCPApiclass.enableExternalIps=true; the REST API does not. Without this, TPU VMshad no internet access and startup-scripts couldn't pull Docker images.
locations/-wildcard: Project-wide TPU listing uses a singleAPI call (
locations/-) instead of iterating all known zones, matchinggcloud's
--zone=-behavior.respective operations before describing the resource.
logging_readto GcpService protocol: Bootstrap log fetchinggoes through the service boundary instead of shelling out to gcloud.
Test plan
🤖 Generated with Claude Code