iris, fray: reject TPU requests whose chip count doesn't match VM shape by rjpower · Pull Request #4791 · marin-community/marin

rjpower · 2026-04-15T20:20:15Z

Summary

A TPU VM is the atomic scheduling unit, but neither the Fray client nor the Iris controller rejected requests where the per-replica chip count differed from the variant's chips_per_vm. The concrete failure mode from the user report:

I keep getting scheduled to nodes that already have something running on them... This particular job only needs 4 chips and says so in the request, but I included v6e-8 as a possible TPU type to claim.

A job submitted with device_alternatives=[\"v6e-4\", \"v6e-8\"] passed the old vm_count-only check (both are vm_count=1), reserved 4 chips per replica against the primary, and then landed on a v6e-8 worker that advertises 8 chips. The scheduler saw 4 free chips and co-scheduled a second 4-chip job onto the same indivisible VM — two tenants colliding on one JAX host.

The diagram:

with_tpu([\"v6e-4\", \"v6e-8\"])  ← old check passes (vm_count both = 1)
  primary = v6e-4 → reserve chips_per_vm = 4
  ↓
lands on v6e-8 worker (advertises 8 chips, 1 VM)
  reserved=4, free=4 ✘ scheduler thinks VM is half-free
  → second 4-chip job co-scheduled onto the same VM → collision

Tighten validation at both ends:

fray (ResourceConfig.with_tpu): candidates must share both vm_count and chips_per_vm. [v4-8, v5p-8] (both 1×4) still works; [v6e-4, v6e-8] (1×4 vs 1×8) now fails fast with a clear message.
iris (launch_job ingestion): new validate_tpu_request() helper runs right after constraint injection and rejects chip-count / VM-shape mismatches with INVALID_ARGUMENT, so older or hand-rolled clients can't bypass the fray-side check.

Test plan

uv run pytest lib/fray/tests/test_v2_iris.py (24 pass; new cases for v6e-4 + v6e-8 rejection and v4-8 + v5p-8 acceptance)
uv run pytest lib/iris/tests/cluster/controller/test_service.py lib/iris/tests/cluster/test_constraints.py (82 pass; new cases for chip-count mismatch, mixed-shape rejection, matched-shape acceptance)
./infra/pre-commit.py --all-files --fix

Closes the user report from Michael Ryan re: v6e-8 co-scheduling collision.

A TPU VM is the atomic scheduling unit, but neither the Fray client nor the Iris controller rejected requests where the per-replica chip count differed from the variant's chips_per_vm. The concrete failure mode: a job submitted with device_alternatives=["v6e-4", "v6e-8"] (both have vm_count=1, so the old check passed) would reserve 4 chips per replica but land on a v6e-8 worker that advertises 8 chips. The scheduler then saw 4 free chips and co-scheduled a second 4-chip job onto the same indivisible VM, causing two tenants to collide on one JAX host. Tighten validation at both ends: - fray: `ResourceConfig.with_tpu()` now requires candidates to share both vm_count and chips_per_vm. Mixing v4-8 and v5p-8 (both 1x4) still works; mixing v6e-4 and v6e-8 (1x4 vs 1x8) now fails fast. - iris: `launch_job` runs a new `validate_tpu_request()` check at RPC ingestion that rejects chip-count/VM-shape mismatches with INVALID_ARGUMENT, so older or hand-rolled clients can't bypass the fray-side check.

claude · 2026-04-15T20:20:34Z

Claude finished @rjpower's task in 2m 37s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notes from the review:

Validation logic in validate_tpu_request correctly mirrors the tpu_device default (count == chips_per_vm), so the per-replica chip-count check at lib/iris/src/iris/cluster/constraints.py:761 matches the proto contract.
The local from iris.cluster.types import get_tpu_topology at line 740 is justified — iris.cluster.types already imports Constraint from iris.cluster.constraints, so a top-level import would create a cycle (matches existing precedent at line 1140).
Defense-in-depth at both fray (with_tpu) and iris (launch_job) is appropriate; older or hand-rolled clients can't bypass the check.
Test coverage covers the three meaningful axes: chip-count mismatch, mixed-shape alternatives rejection, and matched-shape (cross-generation) acceptance.
Check PR review-ability
Identify relevant AGENTS.md files
Summarize PR changes
Run review (bugs + AGENTS.md compliance)
Validate findings
Post review
• Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 027be1e593

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

An explicit `device-variant` constraint is authoritative for scheduling (it replaces the auto-generated constraint from the primary variant), so checking chips_per_vm only against the primary missed the case where a request with primary v6e-4 and `device-variant EQ v6e-8` would still schedule onto a v6e-8 VM while reserving only 4 of its 8 chips. Loop the chip-count check over every effective candidate, and add a regression test for the primary-v6e-4 / constraint-v6e-8 override case.

rjpower · 2026-04-15T20:43:16Z

🤖 Re Codex P1 on constraints.py:755 — good catch, fixed in f987eca. The chip-count check now loops over every effective candidate from the device-variant constraint (falling back to the primary when unconstrained), so a request with primary v6e-4 + device-variant EQ v6e-8 is rejected. Regression test added: test_launch_job_rejects_variant_override_with_smaller_primary.

XenonMolecule

Looks great! Thanks Russell!

rjpower added the agent-generated Created by automation/agent label Apr 15, 2026

chatgpt-codex-connector bot reviewed Apr 15, 2026

View reviewed changes

Comment thread lib/iris/src/iris/cluster/constraints.py Outdated

rjpower requested a review from XenonMolecule April 15, 2026 21:11

rjpower enabled auto-merge (squash) April 15, 2026 21:19

XenonMolecule approved these changes Apr 15, 2026

View reviewed changes

rjpower merged commit 31e8ee3 into main Apr 15, 2026
39 of 40 checks passed

rjpower deleted the claude/friendly-varahamihira branch April 15, 2026 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iris, fray: reject TPU requests whose chip count doesn't match VM shape#4791

iris, fray: reject TPU requests whose chip count doesn't match VM shape#4791
rjpower merged 2 commits intomainfrom
claude/friendly-varahamihira

rjpower commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

XenonMolecule left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented Apr 15, 2026

Summary

Test plan

Uh oh!

claude bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

XenonMolecule left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 15, 2026 •

edited

Loading