Conversation
Analyze controller startup flow, scheduling loop, worker sync, autoscaler, and checkpoint systems. Identify all side-effectful operations that need gating for a --dry-run flag.
New workflow runs the Iris integration test suite against a warm CoreWeave cluster on every PR touching lib/iris/. Unlike the canary runs, nodepools persist between runs for fast startup — only controller and worker pods are reset via `cluster controller restart`. Files: - .github/workflows/iris-coreweave-ci.yaml: PR-triggered workflow - lib/iris/examples/coreweave-ci.yaml: cluster config with min=max=1 slices
|
Claude finished @rjpower's task in 1m 4s —— View job Review: PR #4174 — Add CoreWeave CI workflow for Iris PRsAdds a GitHub Actions workflow to run the Iris integration test suite against a persistent CoreWeave cluster, plus a cluster config and a dry-run analysis doc. The workflow and cluster config are well-structured and consistent with the existing CW smoke workflow patterns. Findings:
Summary: The workflow is solid and follows established patterns from the existing CW smoke workflow. The main actionable item is removing or relocating |
Delete iris-controller-dry-run-analysis.md — research artifact that should not be in the PR.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87830a3191
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| uv run pytest tests/integration/iris/ \ | ||
| --controller-url "$IRIS_CONTROLLER_URL" \ | ||
| -v --tb=short --timeout=600 \ |
There was a problem hiding this comment.
Install test deps before running integration pytest
This step invokes uv run pytest ... --timeout=600 from the repo root, but the workflow never installs dev/test dependencies for the root workspace (unlike .github/workflows/iris-integration.yaml, which runs uv sync ... --group dev --extra=cpu --extra=dedup first). In this configuration, required pytest plugins/deps (notably pytest-timeout for --timeout) may be missing, so the job can fail before executing the integration suite.
Useful? React with 👍 / 👎.
| github.event_name == 'pull_request' || | ||
| github.event_name == 'workflow_dispatch' || |
There was a problem hiding this comment.
Restrict secret-dependent path for fork pull requests
The job runs unconditionally for pull_request events, but later steps require repo secrets (CW_KUBECONFIG, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY). For PRs opened from forks, GitHub does not expose these secrets, so this path will fail systematically for external contributors; add a same-repo guard (as used in other workflows) or limit this trigger to trusted/manual entry points.
Useful? React with 👍 / 👎.
`cluster controller restart` first tries to tunnel to the existing controller, which hangs for 90s when the namespace doesn't exist (cold start). On K8s, `restart_controller()` just delegates to `start_controller()` anyway, so `cluster start` achieves the same result without the tunnel overhead.
- Label selector used `iris.managed` but Labels(iris-ci) generates `iris-iris-ci-managed` — pod reset was silently matching nothing. - Add fork guard to skip PR runs from forks (secrets would be empty). - Add --grace-period=0 to pod deletion for faster CI cleanup. - Fix misleading startup_grace_period comment.
The `iris cluster start` command uses `docker buildx build --push` with registry cache, which requires the buildx driver. GitHub Actions' default docker driver doesn't support cache export. Adding setup-buildx-action@v3 matches what docker-images.yaml does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Fixed the docker buildx error: added The |
- Restrict pull_request trigger to types: [opened, synchronize] to avoid unnecessary runs on label/assignment changes - Add pull-requests: read permission for issue_comment trigger - Detect port-forward process death during health check loop - Use pkill fallback in cleanup step for robustness when PF_PID is empty Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Iris controller exposes /health (not /healthz). The wrong endpoint caused the health poll to loop for 5 minutes then fail.
|
🤖 Root cause of the CI stall: the health check was polling Verified by connecting to the running CoreWeave cluster:
|
|
@claude the Coreweave CI test is failing in the port allocation test: tests/integration/iris/test_iris_integration.py::test_port_allocation FAILED [ 33%] =================================== FAILURES =================================== -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
let's just remove that test -- in practice we're using host networking in a VPC so we don't use ports. |
|
Claude finished @rjpower's task in 1m 17s —— View job Removed
|
Port allocation is not used in practice since we use host networking in a VPC. The test was failing on CoreWeave CI. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@claude it looks like exec_in_container isn't implemented correctly for the k8s provider: Iris - CoreWeave CI / cw-ci-test (pull_request) tests/integration/iris/test_iris_integration.py::test_profile_running_task SKIPPED [ 64%] =================================== FAILURES =================================== can you fix? also why do i see a "local cluster" here? we should be directly using the CW via the @coreweave-ci.yaml configuration |
|
Claude finished @rjpower's task in 7m 29s —— View job Fixed
|
The controller's exec_in_container RPC assumed worker daemons exist, but the K8s provider uses a direct provider model (no workers). Add exec_in_container to K8sTaskProvider using kubectl exec, and update the controller service to fall back to the direct provider when no worker_id is set — matching the existing pattern used by profile_task. Co-authored-by: Russell Power <rjpower@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was using a local tmpdir as MARIN_PREFIX, but Zephyr coordinator jobs run on the CW cluster and can't access the CI runner's filesystem. Switch to s3://marin-na/temp/ci/<run-id>, upload synth data to S3, and add R2 credentials to the CI test step.
Local CI runs don't have R2 credentials, so fall back to a local tmpdir when MARIN_CI_S3_PREFIX is not set. The CW workflow sets it.
The marin-on-iris integration test was running executor_main in-process on the CI runner. Child Zephyr jobs on remote pods had no S3 credentials because Iris env propagation only works parent→child, and there was no parent Iris job. Submit the whole executor as an Iris job with S3 env vars so Zephyr coordinator/workers inherit them automatically. Also enable live log output (-s --log-cli-level=INFO) for the local iris integration test workflow.
The local integration test runs in-process (local filesystem is accessible), while the CoreWeave test submits the executor as an Iris job so child Zephyr jobs inherit S3 credentials via auto-propagation.
The HuggingFace datasets library injects requote_redirect_url=False into client_kwargs via its CompressionFilesystem. This aiohttp kwarg gets forwarded to aiobotocore.session.create_client() which rejects it, breaking all S3-backed reads (e.g. on R2/CoreWeave CI). Use fsspec open_url directly for jsonl and pyarrow for parquet — no need for the datasets library here.
fs.makedirs("/tmp/...") on an S3FileSystem tries to CreateBucket,
which fails with AccessDenied on R2. The /tmp path is always local.
The classifier code (train + inference) uses patterns that don't work in distributed S3 environments: fs.makedirs on local /tmp paths via S3FileSystem, HF datasets injecting aiohttp kwargs into botocore, and single-node FileLock assumptions. Remove these steps from the integration test so the CW CI can pass. See #4183 for the tracking issue.
Replace the pytest wrapper with a standalone script that streams logs in real time. The test takes ~10 minutes and pytest swallows all output until completion, making failures hard to diagnose. The script runs as its own workflow step with `stream_logs=True` on the Iris job handle, so executor and child job output appears immediately. Inlines create_steps rather than importing from tests.integration_test to avoid sys.path hacks and to allow independent evolution of the pipeline steps for this test.
Delete the ephemeral smoke-cw workflow, dev-restart workflow, multihost canary config and script, and smoke-cw cluster config. All replaced by the always-on coreweave-ci cluster added in #4174. Rewrite run_smoke_cw.sh as a thin kubectl port-forward wrapper over pytest tests/integration/iris/. Deleted: iris-cloud-smoke-cw.yaml, iris-coreweave-dev-restart.yaml, smoke-cw.yaml, coreweave-canary-multihost.yaml, coreweave_multihost.py.
- New workflow `.github/workflows/iris-coreweave-ci.yaml` runs the Iris integration test suite (`tests/integration/iris/`) against a persistent CoreWeave cluster on every PR touching `lib/iris/**` - New cluster config `lib/iris/examples/coreweave-ci.yaml` with CPU and GPU×8 scale groups pinned at min=max=1 slice for warm starts - Unlike the existing CW smoke/canary workflows, nodepools are **not** torn down between runs — only controller and worker pods are reset via `cluster controller restart` - Concurrency limit of 1 (shared warm cluster); triggered by PR paths, `/iris-ci-cw` comment, or manual dispatch
Delete the ephemeral smoke-cw workflow, dev-restart workflow, multihost canary config and script, and smoke-cw cluster config. All replaced by the always-on coreweave-ci cluster added in #4174. Rewrite run_smoke_cw.sh as a thin kubectl port-forward wrapper over pytest tests/integration/iris/. Deleted: iris-cloud-smoke-cw.yaml, iris-coreweave-dev-restart.yaml, smoke-cw.yaml, coreweave-canary-multihost.yaml, coreweave_multihost.py.
Summary
.github/workflows/iris-coreweave-ci.yamlruns the Iris integration test suite (tests/integration/iris/) against a persistent CoreWeave cluster on every PR touchinglib/iris/**lib/iris/examples/coreweave-ci.yamlwith CPU and GPU×8 scale groups pinned at min=max=1 slice for warm startscluster controller restart/iris-ci-cwcomment, or manual dispatchDesign
The workflow:
iris cluster controller restartwhich rebuilds images and restarts the controller (or starts fresh on first run)kubectl port-forwardtunnel to the controller serviceiris-integration.yamlagainst the real CW clusterTest plan
lib/iris//iris-ci-cwcomment trigger works🤖 Generated with Claude Code