iris: eliminate Platform abstraction, reorganize into Service + Provider layers by rjpower · Pull Request #3900 · marin-community/marin

rjpower · 2026-03-20T03:46:08Z

Eliminates the monolithic Platform protocol and reorganizes Iris infrastructure code into clean Service and Provider layers.

Architecture

Layer	Purpose	Examples
Service	External service wrapper + fake	`GcpService`/`CloudGcpService`/`InMemoryGcpService`, `K8sService`/`CloudK8sService`/`InMemoryK8sService`
Provider	Iris semantic layer	`ControllerProvider`, `WorkerInfraProvider`, `TaskProvider`

Provider Hierarchy:

GCP: GcpControllerProvider + GcpWorkerProvider (backed by GcpService)
K8s: K8sControllerProvider + K8sTaskProvider (backed by K8sService, no autoscaler/VMs/slices)
Manual: ManualControllerProvider + ManualWorkerProvider
Local: LocalCluster (uses GcpWorkerProvider + InMemoryGcpService(LOCAL))

Key Changes

Delete platform/ directory entirely (monolithic Platform protocol, GcpPlatform, CoreweavePlatform, etc.)
Delete cluster/k8s/ directory (moved into providers/k8s/)
New providers/ package with clean boundaries:
- providers/protocols.py — ControllerProvider, WorkerInfraProvider
- providers/types.py — handle protocols, exceptions (InfraError), status types
- providers/factory.py — ProviderBundle + create_provider_bundle
- providers/gcp/ — controller, workers, handles, service, fake, bootstrap
- providers/k8s/ — controller, tasks, service, fake, types, constants
- providers/manual/ — controller + worker providers
- providers/local/ — LocalCluster
Rename PlatformError → InfraError across all consumers
Autoscaler: platform: Platform → platform: WorkerInfraProvider
vm_lifecycle: resolve_image passed as Callable to bootstrap
K8s path no longer forced through GCP concepts (no VMs, no slices)

chatgpt-codex-connector · 2026-03-20T03:46:14Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

yonromai

Approved with follow-ups. The provider/service split looks coherent overall, and the iris cluster suite plus the local CLI e2e test both passed in this worktree. I did find one workflow-facing regression (cluster start-smoke still calls the removed IrisConfig.platform()) and one smaller CLI diagnostics regression (iris job run --config ... no longer triggers provider debug reports).

Generated with Codex.

yonromai · 2026-03-23T17:36:05Z

+        return self.provider_bundle().workers

-        return create_platform(
+    def provider_bundle(self):


P1: Removing IrisConfig.platform() breaks the still-live cluster start-smoke path. cluster_start_smoke() in lib/iris/src/iris/cli/cluster.py still calls iris_config.platform(), and both smoke workflows shell out to that command, so this now fails with AttributeError before the controller starts. Either keep a temporary compatibility shim for platform() in this PR or switch start-smoke over to provider_bundle().controller like the other CLI entry points.

Generated with Codex.

yonromai · 2026-03-23T17:36:05Z

-        platform = iris_config.platform()
-        ctx.obj["platform"] = platform
+        bundle = iris_config.provider_bundle()
+        ctx.obj["provider_bundle"] = bundle


P2: require_controller_url() now stores provider_bundle on ctx.obj, but iris.cli.job.run still looks up ctx.obj["platform"] before calling debug_report(). That silently drops the provider post-mortem on failed iris job run --config ... invocations. Keeping the old key as an alias, or updating job.py to read bundle.controller, would preserve the existing diagnostics behavior.

Generated with Codex.

…n, and test cleanup - Extract GcpService/K8sService protocols with DRY_RUN/LOCAL/CLOUD modes - Reorganize providers into providers/{gcp,k8s,manual}/ structure - Eliminate Platform abstraction, reorganize into Service + Provider layers - Replace FakeGcloud/FakeKubectl/mock_kubectl with real service implementations - Parameterized GCP+K8s test harness, K8s LOCAL mode, e2e migration - Split and merge test files for better focus (autoscaler, scheduler, transitions) - Delete dead code: FakeWorkerProvider, LocalPlatform, SubprocessK8s escape hatch

- cluster start-smoke: replace removed iris_config.platform() with provider_bundle().controller (fixes cloud-smoke-test CI failure) - job run: read ctx.obj["provider_bundle"] instead of ctx.obj["platform"] for post-failure debug_report() - Extract _spawn_bootstrap_thread() to deduplicate identical bootstrap thread launching in _create_tpu_slice() and _create_vm_slice() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TPU slice names were constructed from raw scale group names which contain underscores and zone suffixes (e.g. tpu_v5e_16-europe-west4-b), producing names that exceed the 63-char GCE limit and contain invalid characters. The VM slice path already had _build_vm_slice_id for this. Rename _build_vm_slice_id -> _build_gce_resource_name and use it for all slice types (TPU, VM, local) so names are normalized to lowercase alphanumeric + hyphens and truncated to fit within GCE limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

InMemoryGcpService.create_local_slice() was not validating slice names against GCE naming rules, so local tests silently accepted names with underscores or >63 chars that would fail on real GCP. Add validation so local mode has the same fidelity as cloud mode. Add test_smoke_gcp_config_boots_locally: loads the real smoke-gcp.yaml, converts to local mode, and verifies workers join. This catches naming issues from zone expansion without needing real GCP resources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR #3900 moved LocalCluster from iris.cluster.local_cluster to iris.cluster.providers.local.cluster. Update the two test files that still referenced the old path.

## Summary - `scripts/iris/dev_tpu.py` still called `IrisConfig.platform()`, which was removed in #3900 - Replace with `provider_bundle().controller` to match the current `IrisConfig` API Fixes #4394 ## Test plan - [x] `uv run python scripts/iris/dev_tpu.py --help` — CLI loads without import errors - [x] `uv run pytest lib/iris/tests/test_dev_tpu.py` — 4/4 pass - [x] `uv run python scripts/iris/dev_tpu.py --config lib/iris/examples/marin.yaml --tpu-name test-dev-tpu allocate --tpu-type v6e-8` — tunnel established, job submitted to live cluster 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…der layers (#3900) Eliminates the monolithic `Platform` protocol and reorganizes Iris infrastructure code into clean **Service** and **Provider** layers. ### Architecture | Layer | Purpose | Examples | |-------|---------|---------| | **Service** | External service wrapper + fake | `GcpService`/`CloudGcpService`/`InMemoryGcpService`, `K8sService`/`CloudK8sService`/`InMemoryK8sService` | | **Provider** | Iris semantic layer | `ControllerProvider`, `WorkerInfraProvider`, `TaskProvider` | **Provider Hierarchy:** - **GCP**: `GcpControllerProvider` + `GcpWorkerProvider` (backed by `GcpService`) - **K8s**: `K8sControllerProvider` + `K8sTaskProvider` (backed by `K8sService`, no autoscaler/VMs/slices) - **Manual**: `ManualControllerProvider` + `ManualWorkerProvider` - **Local**: `LocalCluster` (uses `GcpWorkerProvider` + `InMemoryGcpService(LOCAL)`) ### Key Changes - **Delete** `platform/` directory entirely (monolithic Platform protocol, GcpPlatform, CoreweavePlatform, etc.) - **Delete** `cluster/k8s/` directory (moved into `providers/k8s/`) - **New** `providers/` package with clean boundaries: - `providers/protocols.py` — `ControllerProvider`, `WorkerInfraProvider` - `providers/types.py` — handle protocols, exceptions (`InfraError`), status types - `providers/factory.py` — `ProviderBundle` + `create_provider_bundle` - `providers/gcp/` — controller, workers, handles, service, fake, bootstrap - `providers/k8s/` — controller, tasks, service, fake, types, constants - `providers/manual/` — controller + worker providers - `providers/local/` — LocalCluster - **Rename** `PlatformError` → `InfraError` across all consumers - **Autoscaler**: `platform: Platform` → `platform: WorkerInfraProvider` - **vm_lifecycle**: `resolve_image` passed as `Callable` to bootstrap - K8s path no longer forced through GCP concepts (no VMs, no slices)

## Summary - `scripts/iris/dev_tpu.py` still called `IrisConfig.platform()`, which was removed in #3900 - Replace with `provider_bundle().controller` to match the current `IrisConfig` API Fixes #4394 ## Test plan - [x] `uv run python scripts/iris/dev_tpu.py --help` — CLI loads without import errors - [x] `uv run pytest lib/iris/tests/test_dev_tpu.py` — 4/4 pass - [x] `uv run python scripts/iris/dev_tpu.py --config lib/iris/examples/marin.yaml --tpu-name test-dev-tpu allocate --tpu-type v6e-8` — tunnel established, job submitted to live cluster 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rjpower added the agent-generated Created by automation/agent label Mar 20, 2026

marin-community deleted a comment from claude Bot Mar 20, 2026

rjpower added a commit that referenced this pull request Mar 20, 2026

wip: take over #3900 [Ixt1Hwre]

d09269a

rjpower force-pushed the iris-test-refactor branch from dbcd1c7 to 7731df4 Compare March 20, 2026 22:35

rjpower changed the title ~~iris: refactor K8s and GCP service layers with clean boundaries~~ iris: Service/Provider refactor with ControllerLifecycle extraction Mar 20, 2026

rjpower changed the title ~~iris: Service/Provider refactor with ControllerLifecycle extraction~~ iris: eliminate Platform abstraction, reorganize into Service + Provider layers Mar 21, 2026

rjpower mentioned this pull request Mar 21, 2026

iris: add Kind-based integration tests for K8s scheduling correctness #3940

Open

rjpower force-pushed the iris-test-refactor branch from b064d24 to 6436ae8 Compare March 23, 2026 17:13

rjpower requested a review from yonromai March 23, 2026 17:13

yonromai approved these changes Mar 23, 2026

View reviewed changes

rjpower mentioned this pull request Mar 23, 2026

iris: decompose K8sControllerProvider into focused modules #3998

Closed

2 tasks

rjpower and others added 5 commits March 23, 2026 14:49

trigger CI

18ada77

rjpower force-pushed the iris-test-refactor branch from 055f2a7 to d7cf593 Compare March 23, 2026 21:50

rjpower merged commit 1e1d2db into main Mar 23, 2026
22 of 23 checks passed

rjpower deleted the iris-test-refactor branch March 23, 2026 22:00

This was referenced Apr 3, 2026

scripts/iris/dev_tpu.py still calls removed IrisConfig.platform() #4394

Closed

[iris] Fix dev_tpu.py to use provider_bundle API #4395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iris: eliminate Platform abstraction, reorganize into Service + Provider layers#3900

iris: eliminate Platform abstraction, reorganize into Service + Provider layers#3900
rjpower merged 5 commits intomainfrom
iris-test-refactor

rjpower commented Mar 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Mar 20, 2026

Uh oh!

yonromai left a comment

Uh oh!

yonromai Mar 23, 2026

Uh oh!

yonromai Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Key Changes

Uh oh!

chatgpt-codex-connector Bot commented Mar 20, 2026

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Mar 20, 2026 •

edited

Loading