iris: eliminate Platform abstraction, reorganize into Service + Provider layers#3900
iris: eliminate Platform abstraction, reorganize into Service + Provider layers#3900
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
dbcd1c7 to
7731df4
Compare
b064d24 to
6436ae8
Compare
yonromai
left a comment
There was a problem hiding this comment.
Approved with follow-ups. The provider/service split looks coherent overall, and the iris cluster suite plus the local CLI e2e test both passed in this worktree. I did find one workflow-facing regression (cluster start-smoke still calls the removed IrisConfig.platform()) and one smaller CLI diagnostics regression (iris job run --config ... no longer triggers provider debug reports).
Generated with Codex.
| return self.provider_bundle().workers | ||
|
|
||
| return create_platform( | ||
| def provider_bundle(self): |
There was a problem hiding this comment.
P1: Removing IrisConfig.platform() breaks the still-live cluster start-smoke path. cluster_start_smoke() in lib/iris/src/iris/cli/cluster.py still calls iris_config.platform(), and both smoke workflows shell out to that command, so this now fails with AttributeError before the controller starts. Either keep a temporary compatibility shim for platform() in this PR or switch start-smoke over to provider_bundle().controller like the other CLI entry points.
Generated with Codex.
| platform = iris_config.platform() | ||
| ctx.obj["platform"] = platform | ||
| bundle = iris_config.provider_bundle() | ||
| ctx.obj["provider_bundle"] = bundle |
There was a problem hiding this comment.
P2: require_controller_url() now stores provider_bundle on ctx.obj, but iris.cli.job.run still looks up ctx.obj["platform"] before calling debug_report(). That silently drops the provider post-mortem on failed iris job run --config ... invocations. Keeping the old key as an alias, or updating job.py to read bundle.controller, would preserve the existing diagnostics behavior.
Generated with Codex.
…n, and test cleanup
- Extract GcpService/K8sService protocols with DRY_RUN/LOCAL/CLOUD modes
- Reorganize providers into providers/{gcp,k8s,manual}/ structure
- Eliminate Platform abstraction, reorganize into Service + Provider layers
- Replace FakeGcloud/FakeKubectl/mock_kubectl with real service implementations
- Parameterized GCP+K8s test harness, K8s LOCAL mode, e2e migration
- Split and merge test files for better focus (autoscaler, scheduler, transitions)
- Delete dead code: FakeWorkerProvider, LocalPlatform, SubprocessK8s escape hatch
- cluster start-smoke: replace removed iris_config.platform() with provider_bundle().controller (fixes cloud-smoke-test CI failure) - job run: read ctx.obj["provider_bundle"] instead of ctx.obj["platform"] for post-failure debug_report() - Extract _spawn_bootstrap_thread() to deduplicate identical bootstrap thread launching in _create_tpu_slice() and _create_vm_slice() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TPU slice names were constructed from raw scale group names which contain underscores and zone suffixes (e.g. tpu_v5e_16-europe-west4-b), producing names that exceed the 63-char GCE limit and contain invalid characters. The VM slice path already had _build_vm_slice_id for this. Rename _build_vm_slice_id -> _build_gce_resource_name and use it for all slice types (TPU, VM, local) so names are normalized to lowercase alphanumeric + hyphens and truncated to fit within GCE limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InMemoryGcpService.create_local_slice() was not validating slice names against GCE naming rules, so local tests silently accepted names with underscores or >63 chars that would fail on real GCP. Add validation so local mode has the same fidelity as cloud mode. Add test_smoke_gcp_config_boots_locally: loads the real smoke-gcp.yaml, converts to local mode, and verifies workers join. This catches naming issues from zone expansion without needing real GCP resources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
055f2a7 to
d7cf593
Compare
PR #3900 moved LocalCluster from iris.cluster.local_cluster to iris.cluster.providers.local.cluster. Update the two test files that still referenced the old path.
## Summary - `scripts/iris/dev_tpu.py` still called `IrisConfig.platform()`, which was removed in #3900 - Replace with `provider_bundle().controller` to match the current `IrisConfig` API Fixes #4394 ## Test plan - [x] `uv run python scripts/iris/dev_tpu.py --help` — CLI loads without import errors - [x] `uv run pytest lib/iris/tests/test_dev_tpu.py` — 4/4 pass - [x] `uv run python scripts/iris/dev_tpu.py --config lib/iris/examples/marin.yaml --tpu-name test-dev-tpu allocate --tpu-type v6e-8` — tunnel established, job submitted to live cluster 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…der layers (#3900) Eliminates the monolithic `Platform` protocol and reorganizes Iris infrastructure code into clean **Service** and **Provider** layers. ### Architecture | Layer | Purpose | Examples | |-------|---------|---------| | **Service** | External service wrapper + fake | `GcpService`/`CloudGcpService`/`InMemoryGcpService`, `K8sService`/`CloudK8sService`/`InMemoryK8sService` | | **Provider** | Iris semantic layer | `ControllerProvider`, `WorkerInfraProvider`, `TaskProvider` | **Provider Hierarchy:** - **GCP**: `GcpControllerProvider` + `GcpWorkerProvider` (backed by `GcpService`) - **K8s**: `K8sControllerProvider` + `K8sTaskProvider` (backed by `K8sService`, no autoscaler/VMs/slices) - **Manual**: `ManualControllerProvider` + `ManualWorkerProvider` - **Local**: `LocalCluster` (uses `GcpWorkerProvider` + `InMemoryGcpService(LOCAL)`) ### Key Changes - **Delete** `platform/` directory entirely (monolithic Platform protocol, GcpPlatform, CoreweavePlatform, etc.) - **Delete** `cluster/k8s/` directory (moved into `providers/k8s/`) - **New** `providers/` package with clean boundaries: - `providers/protocols.py` — `ControllerProvider`, `WorkerInfraProvider` - `providers/types.py` — handle protocols, exceptions (`InfraError`), status types - `providers/factory.py` — `ProviderBundle` + `create_provider_bundle` - `providers/gcp/` — controller, workers, handles, service, fake, bootstrap - `providers/k8s/` — controller, tasks, service, fake, types, constants - `providers/manual/` — controller + worker providers - `providers/local/` — LocalCluster - **Rename** `PlatformError` → `InfraError` across all consumers - **Autoscaler**: `platform: Platform` → `platform: WorkerInfraProvider` - **vm_lifecycle**: `resolve_image` passed as `Callable` to bootstrap - K8s path no longer forced through GCP concepts (no VMs, no slices)
## Summary - `scripts/iris/dev_tpu.py` still called `IrisConfig.platform()`, which was removed in #3900 - Replace with `provider_bundle().controller` to match the current `IrisConfig` API Fixes #4394 ## Test plan - [x] `uv run python scripts/iris/dev_tpu.py --help` — CLI loads without import errors - [x] `uv run pytest lib/iris/tests/test_dev_tpu.py` — 4/4 pass - [x] `uv run python scripts/iris/dev_tpu.py --config lib/iris/examples/marin.yaml --tpu-name test-dev-tpu allocate --tpu-type v6e-8` — tunnel established, job submitted to live cluster 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminates the monolithic
Platformprotocol and reorganizes Iris infrastructure code into clean Service and Provider layers.Architecture
GcpService/CloudGcpService/InMemoryGcpService,K8sService/CloudK8sService/InMemoryK8sServiceControllerProvider,WorkerInfraProvider,TaskProviderProvider Hierarchy:
GcpControllerProvider+GcpWorkerProvider(backed byGcpService)K8sControllerProvider+K8sTaskProvider(backed byK8sService, no autoscaler/VMs/slices)ManualControllerProvider+ManualWorkerProviderLocalCluster(usesGcpWorkerProvider+InMemoryGcpService(LOCAL))Key Changes
platform/directory entirely (monolithic Platform protocol, GcpPlatform, CoreweavePlatform, etc.)cluster/k8s/directory (moved intoproviders/k8s/)providers/package with clean boundaries:providers/protocols.py—ControllerProvider,WorkerInfraProviderproviders/types.py— handle protocols, exceptions (InfraError), status typesproviders/factory.py—ProviderBundle+create_provider_bundleproviders/gcp/— controller, workers, handles, service, fake, bootstrapproviders/k8s/— controller, tasks, service, fake, types, constantsproviders/manual/— controller + worker providersproviders/local/— LocalClusterPlatformError→InfraErroracross all consumersplatform: Platform→platform: WorkerInfraProviderresolve_imagepassed asCallableto bootstrap