-
Notifications
You must be signed in to change notification settings - Fork 110
Add CoreWeave CI workflow for Iris PRs #4174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
1a6bc2e
research: iris controller dry-run mode analysis
rjpower 5710dc5
Add CoreWeave CI workflow for Iris PRs
rjpower 87830a3
Simplify: fix health check timeout, remove dead env var, trim comments
rjpower 4c66d88
Remove analysis document from PR
rjpower b6e9cb3
Use `cluster start` instead of `cluster controller restart` for CI
rjpower a6f0c99
Fix managed label selector, add fork guard, cleanup
rjpower ce841e8
Add docker/setup-buildx-action to fix image build in CI
rjpower 42c95f4
fix: address PR review feedback on coreweave CI workflow
rjpower 6e45031
fix: use /health not /healthz for controller health check
rjpower ecdb5af
Remove test_port_allocation from integration tests
github-actions[bot] 2fc1593
Fix exec_in_container for K8s direct provider
github-actions[bot] ab18b7c
fix: use S3 prefix for marin-on-iris test so remote pods can access data
rjpower eb2f779
fix: gate S3 usage on MARIN_CI_S3_PREFIX env var
rjpower e03ff4f
fix: submit executor as Iris job so child jobs inherit S3 env vars
rjpower 78c6c61
fix: only submit executor as Iris job on remote clusters
rjpower 1ecd513
fix: replace datasets.load_dataset with fsspec in read_dataset_streaming
rjpower 49201af
fix: use os.makedirs for local /tmp path in classifier model download
rjpower c54fb0d
fix: remove fasttext classifier steps from integration test
rjpower f586ed8
refactor: run marin-on-iris test as standalone script for streaming logs
rjpower File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| name: Iris - CoreWeave CI | ||
|
|
||
| on: | ||
| pull_request: | ||
| types: [opened, synchronize] | ||
| paths: | ||
| - "lib/iris/**" | ||
| issue_comment: | ||
| types: [created] | ||
| workflow_dispatch: | ||
|
|
||
| permissions: | ||
| contents: read | ||
| packages: write | ||
| pull-requests: read # needed for issue_comment to access PR metadata | ||
| statuses: write # post commit status from issue_comment trigger | ||
|
|
||
| # Single concurrency group — only one CW CI run at a time across all PRs. | ||
| # The warm cluster is shared; concurrent runs would conflict. | ||
| concurrency: | ||
| group: iris-coreweave-ci | ||
| cancel-in-progress: false | ||
|
|
||
| jobs: | ||
| cw-ci-test: | ||
| if: >- | ||
| (github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name == github.repository) || | ||
| github.event_name == 'workflow_dispatch' || | ||
| ( | ||
| github.event_name == 'issue_comment' && | ||
| github.event.issue.pull_request && | ||
| contains(github.event.comment.body, '/iris-ci-cw') && | ||
| ( | ||
| github.event.comment.author_association == 'MEMBER' || | ||
| github.event.comment.author_association == 'COLLABORATOR' || | ||
| github.event.comment.author_association == 'OWNER' | ||
| ) | ||
| ) | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 60 | ||
| env: | ||
| IRIS_NAMESPACE: iris-ci | ||
| # Must match Labels(label_prefix).iris_managed from the cluster config | ||
| IRIS_MANAGED_LABEL: iris-iris-ci-managed | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| ref: ${{ github.event_name == 'issue_comment' && format('refs/pull/{0}/head', github.event.issue.number) || '' }} | ||
|
|
||
| - name: Set commit status to pending | ||
| if: github.event_name == 'issue_comment' | ||
| env: | ||
| GH_TOKEN: ${{ github.token }} | ||
| run: | | ||
| sha=$(git rev-parse HEAD) | ||
| gh api repos/${{ github.repository }}/statuses/"$sha" \ | ||
| -f state=pending \ | ||
| -f context="Iris CoreWeave CI" \ | ||
| -f target_url="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" || true | ||
|
|
||
| - name: Set up Python 3.12 | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.12" | ||
|
|
||
| - name: Install uv | ||
| uses: astral-sh/setup-uv@v7 | ||
| with: | ||
| enable-cache: true | ||
| cache-dependency-glob: "lib/iris/pyproject.toml" | ||
|
|
||
| - name: Write kubeconfig | ||
| run: | | ||
| mkdir -p ~/.kube | ||
| echo "${{ secrets.CW_KUBECONFIG }}" > ~/.kube/coreweave-iris | ||
| chmod 600 ~/.kube/coreweave-iris | ||
|
|
||
| - name: Log in to GitHub Container Registry | ||
| uses: docker/login-action@v3 | ||
| with: | ||
| registry: ghcr.io | ||
| username: ${{ github.actor }} | ||
| password: ${{ secrets.GITHUB_TOKEN }} | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 | ||
|
|
||
| # Delete stale worker pods so the autoscaler recreates them with fresh images. | ||
| # Nodepools (and their underlying nodes) survive — this is the "warm start". | ||
| - name: Reset worker pods | ||
| run: | | ||
| export KUBECONFIG=~/.kube/coreweave-iris | ||
| kubectl delete pods -n "$IRIS_NAMESPACE" -l "$IRIS_MANAGED_LABEL=true" --grace-period=0 --ignore-not-found || true | ||
|
|
||
| # Rebuild images and (re)start the controller. `cluster start` is fully | ||
| # idempotent on K8s: it applies namespace/RBAC/ConfigMap/Deployment/Service | ||
| # and triggers a rollout restart, so both cold starts and warm restarts | ||
| # work without needing to tunnel to an existing controller first. | ||
| - name: Start controller | ||
| env: | ||
| R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} | ||
| R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} | ||
| run: | | ||
| cd lib/iris && uv run --group dev iris -v \ | ||
| --config=examples/coreweave-ci.yaml \ | ||
| cluster start | ||
|
|
||
| - name: Run integration tests | ||
| env: | ||
| WANDB_MODE: disabled | ||
| WANDB_API_KEY: "" | ||
| JAX_TRACEBACK_FILTERING: off | ||
| # When set, the marin-on-iris test uploads fixtures and writes | ||
| # intermediate data to S3 (R2) so remote Zephyr pods can access them. | ||
| MARIN_CI_S3_PREFIX: s3://marin-na/temp/ci | ||
| AWS_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} | ||
| AWS_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} | ||
| AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com | ||
| FSSPEC_S3: '{"endpoint_url": "https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com"}' | ||
| run: | | ||
| export KUBECONFIG=~/.kube/coreweave-iris | ||
| kubectl port-forward -n "$IRIS_NAMESPACE" svc/iris-ci-controller-svc 10000:10000 & | ||
| PF_PID=$! | ||
| echo "PF_PID=$PF_PID" >> "$GITHUB_ENV" | ||
|
|
||
| IRIS_CONTROLLER_URL="http://localhost:10000" | ||
|
|
||
| # Controller deployment is already confirmed ready by `cluster start`; | ||
| # this just waits for the port-forward to be usable. | ||
| HEALTHY=false | ||
| for i in $(seq 1 60); do | ||
| if ! kill -0 "$PF_PID" 2>/dev/null; then | ||
| echo "port-forward process died unexpectedly" | ||
| exit 1 | ||
| fi | ||
| if curl -sf "$IRIS_CONTROLLER_URL/health" > /dev/null 2>&1; then | ||
| HEALTHY=true | ||
| break | ||
| fi | ||
| sleep 5 | ||
| done | ||
| if [ "$HEALTHY" != "true" ]; then | ||
| echo "Controller did not become healthy within timeout" | ||
| exit 1 | ||
| fi | ||
|
|
||
| uv run pytest tests/integration/iris/ \ | ||
| --controller-url "$IRIS_CONTROLLER_URL" \ | ||
| -v --tb=short --timeout=600 \ | ||
| -o "addopts=" \ | ||
| -x | ||
|
|
||
| - name: Run full integration pipeline | ||
| env: | ||
| WANDB_MODE: disabled | ||
| WANDB_API_KEY: "" | ||
| JAX_TRACEBACK_FILTERING: off | ||
| MARIN_CI_S3_PREFIX: s3://marin-na/temp/ci | ||
| AWS_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }} | ||
| AWS_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }} | ||
| AWS_ENDPOINT_URL: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com | ||
| FSSPEC_S3: '{"endpoint_url": "https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com"}' | ||
| run: | | ||
| IRIS_CONTROLLER_URL="http://localhost:10000" | ||
| timeout 600 uv run tests/integration/iris/run_iris_full_integration.py \ | ||
| --controller-url "$IRIS_CONTROLLER_URL" | ||
|
|
||
| - name: Stop port-forward | ||
| if: always() | ||
| run: | | ||
| [ -n "$PF_PID" ] && kill "$PF_PID" 2>/dev/null || true | ||
| pkill -f "kubectl port-forward.*$IRIS_NAMESPACE" 2>/dev/null || true | ||
|
|
||
| - name: Capture failure diagnostics | ||
| if: failure() | ||
| run: | | ||
| export KUBECONFIG=~/.kube/coreweave-iris | ||
| echo "=== Controller logs ===" | ||
| kubectl -n "$IRIS_NAMESPACE" logs -l app=iris-controller --tail=500 || true | ||
| echo "=== Controller pod describe ===" | ||
| kubectl -n "$IRIS_NAMESPACE" describe pod -l app=iris-controller || true | ||
| echo "=== Worker pods ===" | ||
| kubectl -n "$IRIS_NAMESPACE" get pods -l "$IRIS_MANAGED_LABEL=true" || true | ||
| echo "=== Warning events ===" | ||
| kubectl -n "$IRIS_NAMESPACE" get events --sort-by='.lastTimestamp' --field-selector type!=Normal || true | ||
|
|
||
| - name: Set commit status to result | ||
| if: always() && github.event_name == 'issue_comment' | ||
| env: | ||
| GH_TOKEN: ${{ github.token }} | ||
| run: | | ||
| sha=$(git rev-parse HEAD) | ||
| if [ "${{ job.status }}" = "success" ]; then | ||
| state=success | ||
| else | ||
| state=failure | ||
| fi | ||
| gh api repos/${{ github.repository }}/statuses/"$sha" \ | ||
| -f state="$state" \ | ||
| -f context="Iris CoreWeave CI" \ | ||
| -f target_url="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| # Persistent CoreWeave CI cluster. Both scale groups are pinned at min=max=1 | ||
| # so nodes stay warm between runs — only controller and worker pods are reset. | ||
|
|
||
| platform: | ||
| label_prefix: iris-ci | ||
| coreweave: | ||
| region: US-WEST-04A | ||
| namespace: iris-ci | ||
| kubeconfig_path: ~/.kube/coreweave-iris | ||
| object_storage_endpoint: https://74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com | ||
|
|
||
| storage: | ||
| remote_state_dir: s3://marin-na/iris/state/ci | ||
|
|
||
| kubernetes_provider: | ||
| namespace: iris-ci | ||
| default_image: ghcr.io/marin-community/iris-task:latest | ||
| host_network: true | ||
| cache_dir: /mnt/local/iris-cache | ||
| controller_address: http://iris-ci-controller-svc.iris-ci.svc.cluster.local:10000 | ||
|
|
||
| controller: | ||
| image: ghcr.io/marin-community/iris-controller:latest | ||
| coreweave: | ||
| port: 10000 | ||
| service_name: iris-ci-controller-svc | ||
| scale_group: cpu-erapids | ||
|
|
||
| defaults: | ||
| autoscaler: | ||
| evaluation_interval: | ||
| milliseconds: 10000 | ||
| scale_up_delay: | ||
| milliseconds: 60000 | ||
| scale_down_delay: | ||
| milliseconds: 300000 | ||
| startup_grace_period: | ||
| milliseconds: 1200000 # 20 min — nodes are pinned warm so this rarely fires | ||
| task_env: | ||
| MARIN_PREFIX: s3://marin-na/marin | ||
| worker: | ||
| docker_image: ghcr.io/marin-community/iris-worker:latest | ||
| port: 10001 | ||
| cache_dir: /mnt/local/iris-cache | ||
| runtime: kubernetes | ||
| default_task_image: ghcr.io/marin-community/iris-task:latest | ||
|
|
||
| scale_groups: | ||
| cpu-erapids: | ||
| num_vms: 1 | ||
| resources: | ||
| cpu: 64 | ||
| ram: 256GB | ||
| disk: 1TB | ||
| device_type: cpu | ||
| preemptible: false | ||
| worker: | ||
| attributes: | ||
| region: US-WEST-04A | ||
| pool: cpu-erapids | ||
| min_slices: 1 | ||
| max_slices: 1 | ||
| priority: 50 | ||
| slice_template: | ||
| num_vms: 1 | ||
| coreweave: | ||
| region: US-WEST-04A | ||
| instance_type: cd-gp-i64-erapids | ||
|
|
||
| h100-8x: | ||
| num_vms: 1 | ||
| resources: | ||
| cpu: 128 | ||
| ram: 2048GB | ||
| disk: 1TB | ||
| device_type: gpu | ||
| device_variant: H100 | ||
| device_count: 8 | ||
| preemptible: false | ||
| worker: | ||
| attributes: | ||
| region: US-WEST-04A | ||
| pool: h100-8x | ||
| min_slices: 1 | ||
| max_slices: 1 | ||
| priority: 100 | ||
| slice_template: | ||
| num_vms: 1 | ||
| coreweave: | ||
| region: US-WEST-04A | ||
| instance_type: gd-8xh100ib-i128 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step invokes
uv run pytest ... --timeout=600from the repo root, but the workflow never installs dev/test dependencies for the root workspace (unlike.github/workflows/iris-integration.yaml, which runsuv sync ... --group dev --extra=cpu --extra=dedupfirst). In this configuration, required pytest plugins/deps (notablypytest-timeoutfor--timeout) may be missing, so the job can fail before executing the integration suite.Useful? React with 👍 / 👎.