Skip to content

iris: eliminate Platform abstraction, reorganize into Service + Provider layers#3900

Merged
rjpower merged 5 commits intomainfrom
iris-test-refactor
Mar 23, 2026
Merged

iris: eliminate Platform abstraction, reorganize into Service + Provider layers#3900
rjpower merged 5 commits intomainfrom
iris-test-refactor

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Mar 20, 2026

Eliminates the monolithic Platform protocol and reorganizes Iris infrastructure code into clean Service and Provider layers.

Architecture

Layer Purpose Examples
Service External service wrapper + fake GcpService/CloudGcpService/InMemoryGcpService, K8sService/CloudK8sService/InMemoryK8sService
Provider Iris semantic layer ControllerProvider, WorkerInfraProvider, TaskProvider

Provider Hierarchy:

  • GCP: GcpControllerProvider + GcpWorkerProvider (backed by GcpService)
  • K8s: K8sControllerProvider + K8sTaskProvider (backed by K8sService, no autoscaler/VMs/slices)
  • Manual: ManualControllerProvider + ManualWorkerProvider
  • Local: LocalCluster (uses GcpWorkerProvider + InMemoryGcpService(LOCAL))

Key Changes

  • Delete platform/ directory entirely (monolithic Platform protocol, GcpPlatform, CoreweavePlatform, etc.)
  • Delete cluster/k8s/ directory (moved into providers/k8s/)
  • New providers/ package with clean boundaries:
    • providers/protocols.pyControllerProvider, WorkerInfraProvider
    • providers/types.py — handle protocols, exceptions (InfraError), status types
    • providers/factory.pyProviderBundle + create_provider_bundle
    • providers/gcp/ — controller, workers, handles, service, fake, bootstrap
    • providers/k8s/ — controller, tasks, service, fake, types, constants
    • providers/manual/ — controller + worker providers
    • providers/local/ — LocalCluster
  • Rename PlatformErrorInfraError across all consumers
  • Autoscaler: platform: Platformplatform: WorkerInfraProvider
  • vm_lifecycle: resolve_image passed as Callable to bootstrap
  • K8s path no longer forced through GCP concepts (no VMs, no slices)

@rjpower rjpower added the agent-generated Created by automation/agent label Mar 20, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Repo admins can enable using credits for code reviews in their settings.

@marin-community marin-community deleted a comment from claude Bot Mar 20, 2026
rjpower added a commit that referenced this pull request Mar 20, 2026
@rjpower rjpower force-pushed the iris-test-refactor branch from dbcd1c7 to 7731df4 Compare March 20, 2026 22:35
@rjpower rjpower changed the title iris: refactor K8s and GCP service layers with clean boundaries iris: Service/Provider refactor with ControllerLifecycle extraction Mar 20, 2026
@rjpower rjpower changed the title iris: Service/Provider refactor with ControllerLifecycle extraction iris: eliminate Platform abstraction, reorganize into Service + Provider layers Mar 21, 2026
@rjpower rjpower force-pushed the iris-test-refactor branch from b064d24 to 6436ae8 Compare March 23, 2026 17:13
@rjpower rjpower requested a review from yonromai March 23, 2026 17:13
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with follow-ups. The provider/service split looks coherent overall, and the iris cluster suite plus the local CLI e2e test both passed in this worktree. I did find one workflow-facing regression (cluster start-smoke still calls the removed IrisConfig.platform()) and one smaller CLI diagnostics regression (iris job run --config ... no longer triggers provider debug reports).

Generated with Codex.

return self.provider_bundle().workers

return create_platform(
def provider_bundle(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Removing IrisConfig.platform() breaks the still-live cluster start-smoke path. cluster_start_smoke() in lib/iris/src/iris/cli/cluster.py still calls iris_config.platform(), and both smoke workflows shell out to that command, so this now fails with AttributeError before the controller starts. Either keep a temporary compatibility shim for platform() in this PR or switch start-smoke over to provider_bundle().controller like the other CLI entry points.

Generated with Codex.

platform = iris_config.platform()
ctx.obj["platform"] = platform
bundle = iris_config.provider_bundle()
ctx.obj["provider_bundle"] = bundle
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: require_controller_url() now stores provider_bundle on ctx.obj, but iris.cli.job.run still looks up ctx.obj["platform"] before calling debug_report(). That silently drops the provider post-mortem on failed iris job run --config ... invocations. Keeping the old key as an alias, or updating job.py to read bundle.controller, would preserve the existing diagnostics behavior.

Generated with Codex.

rjpower and others added 5 commits March 23, 2026 14:49
…n, and test cleanup

- Extract GcpService/K8sService protocols with DRY_RUN/LOCAL/CLOUD modes
- Reorganize providers into providers/{gcp,k8s,manual}/ structure
- Eliminate Platform abstraction, reorganize into Service + Provider layers
- Replace FakeGcloud/FakeKubectl/mock_kubectl with real service implementations
- Parameterized GCP+K8s test harness, K8s LOCAL mode, e2e migration
- Split and merge test files for better focus (autoscaler, scheduler, transitions)
- Delete dead code: FakeWorkerProvider, LocalPlatform, SubprocessK8s escape hatch
- cluster start-smoke: replace removed iris_config.platform() with
  provider_bundle().controller (fixes cloud-smoke-test CI failure)
- job run: read ctx.obj["provider_bundle"] instead of ctx.obj["platform"]
  for post-failure debug_report()
- Extract _spawn_bootstrap_thread() to deduplicate identical bootstrap
  thread launching in _create_tpu_slice() and _create_vm_slice()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TPU slice names were constructed from raw scale group names which
contain underscores and zone suffixes (e.g. tpu_v5e_16-europe-west4-b),
producing names that exceed the 63-char GCE limit and contain invalid
characters. The VM slice path already had _build_vm_slice_id for this.

Rename _build_vm_slice_id -> _build_gce_resource_name and use it for
all slice types (TPU, VM, local) so names are normalized to lowercase
alphanumeric + hyphens and truncated to fit within GCE limits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InMemoryGcpService.create_local_slice() was not validating slice names
against GCE naming rules, so local tests silently accepted names with
underscores or >63 chars that would fail on real GCP. Add validation so
local mode has the same fidelity as cloud mode.

Add test_smoke_gcp_config_boots_locally: loads the real smoke-gcp.yaml,
converts to local mode, and verifies workers join. This catches naming
issues from zone expansion without needing real GCP resources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rjpower rjpower force-pushed the iris-test-refactor branch from 055f2a7 to d7cf593 Compare March 23, 2026 21:50
@rjpower rjpower merged commit 1e1d2db into main Mar 23, 2026
22 of 23 checks passed
@rjpower rjpower deleted the iris-test-refactor branch March 23, 2026 22:00
rjpower added a commit that referenced this pull request Mar 24, 2026
PR #3900 moved LocalCluster from iris.cluster.local_cluster to
iris.cluster.providers.local.cluster. Update the two test files that
still referenced the old path.
yonromai added a commit that referenced this pull request Apr 3, 2026
## Summary

- `scripts/iris/dev_tpu.py` still called `IrisConfig.platform()`, which
was removed in #3900
- Replace with `provider_bundle().controller` to match the current
`IrisConfig` API

Fixes #4394

## Test plan

- [x] `uv run python scripts/iris/dev_tpu.py --help` — CLI loads without
import errors
- [x] `uv run pytest lib/iris/tests/test_dev_tpu.py` — 4/4 pass
- [x] `uv run python scripts/iris/dev_tpu.py --config
lib/iris/examples/marin.yaml --tpu-name test-dev-tpu allocate --tpu-type
v6e-8` — tunnel established, job submitted to live cluster

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
…der layers (#3900)

Eliminates the monolithic `Platform` protocol and reorganizes Iris
infrastructure code into clean **Service** and **Provider** layers.

### Architecture

| Layer | Purpose | Examples |
|-------|---------|---------|
| **Service** | External service wrapper + fake |
`GcpService`/`CloudGcpService`/`InMemoryGcpService`,
`K8sService`/`CloudK8sService`/`InMemoryK8sService` |
| **Provider** | Iris semantic layer | `ControllerProvider`,
`WorkerInfraProvider`, `TaskProvider` |

**Provider Hierarchy:**
- **GCP**: `GcpControllerProvider` + `GcpWorkerProvider` (backed by
`GcpService`)
- **K8s**: `K8sControllerProvider` + `K8sTaskProvider` (backed by
`K8sService`, no autoscaler/VMs/slices)
- **Manual**: `ManualControllerProvider` + `ManualWorkerProvider`
- **Local**: `LocalCluster` (uses `GcpWorkerProvider` +
`InMemoryGcpService(LOCAL)`)

### Key Changes

- **Delete** `platform/` directory entirely (monolithic Platform
protocol, GcpPlatform, CoreweavePlatform, etc.)
- **Delete** `cluster/k8s/` directory (moved into `providers/k8s/`)
- **New** `providers/` package with clean boundaries:
- `providers/protocols.py` — `ControllerProvider`, `WorkerInfraProvider`
- `providers/types.py` — handle protocols, exceptions (`InfraError`),
status types
  - `providers/factory.py` — `ProviderBundle` + `create_provider_bundle`
- `providers/gcp/` — controller, workers, handles, service, fake,
bootstrap
- `providers/k8s/` — controller, tasks, service, fake, types, constants
  - `providers/manual/` — controller + worker providers
  - `providers/local/` — LocalCluster
- **Rename** `PlatformError` → `InfraError` across all consumers
- **Autoscaler**: `platform: Platform` → `platform: WorkerInfraProvider`
- **vm_lifecycle**: `resolve_image` passed as `Callable` to bootstrap
- K8s path no longer forced through GCP concepts (no VMs, no slices)
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
## Summary

- `scripts/iris/dev_tpu.py` still called `IrisConfig.platform()`, which
was removed in #3900
- Replace with `provider_bundle().controller` to match the current
`IrisConfig` API

Fixes #4394

## Test plan

- [x] `uv run python scripts/iris/dev_tpu.py --help` — CLI loads without
import errors
- [x] `uv run pytest lib/iris/tests/test_dev_tpu.py` — 4/4 pass
- [x] `uv run python scripts/iris/dev_tpu.py --config
lib/iris/examples/marin.yaml --tpu-name test-dev-tpu allocate --tpu-type
v6e-8` — tunnel established, job submitted to live cluster

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Romain Yon <1596570+yonromai@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants