iris, fray: reject TPU requests whose chip count doesn't match VM shape#4791
iris, fray: reject TPU requests whose chip count doesn't match VM shape#4791
Conversation
A TPU VM is the atomic scheduling unit, but neither the Fray client nor the Iris controller rejected requests where the per-replica chip count differed from the variant's chips_per_vm. The concrete failure mode: a job submitted with device_alternatives=["v6e-4", "v6e-8"] (both have vm_count=1, so the old check passed) would reserve 4 chips per replica but land on a v6e-8 worker that advertises 8 chips. The scheduler then saw 4 free chips and co-scheduled a second 4-chip job onto the same indivisible VM, causing two tenants to collide on one JAX host. Tighten validation at both ends: - fray: `ResourceConfig.with_tpu()` now requires candidates to share both vm_count and chips_per_vm. Mixing v4-8 and v5p-8 (both 1x4) still works; mixing v6e-4 and v6e-8 (1x4 vs 1x8) now fails fast. - iris: `launch_job` runs a new `validate_tpu_request()` check at RPC ingestion that rejects chip-count/VM-shape mismatches with INVALID_ARGUMENT, so older or hand-rolled clients can't bypass the fray-side check.
|
Claude finished @rjpower's task in 2m 37s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. Notes from the review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 027be1e593
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
An explicit `device-variant` constraint is authoritative for scheduling (it replaces the auto-generated constraint from the primary variant), so checking chips_per_vm only against the primary missed the case where a request with primary v6e-4 and `device-variant EQ v6e-8` would still schedule onto a v6e-8 VM while reserving only 4 of its 8 chips. Loop the chip-count check over every effective candidate, and add a regression test for the primary-v6e-4 / constraint-v6e-8 override case.
|
🤖 Re Codex P1 on constraints.py:755 — good catch, fixed in f987eca. The chip-count check now loops over every effective candidate from the |
Summary
A TPU VM is the atomic scheduling unit, but neither the Fray client nor the Iris controller rejected requests where the per-replica chip count differed from the variant's
chips_per_vm. The concrete failure mode from the user report:A job submitted with
device_alternatives=[\"v6e-4\", \"v6e-8\"]passed the oldvm_count-only check (both are vm_count=1), reserved 4 chips per replica against the primary, and then landed on a v6e-8 worker that advertises 8 chips. The scheduler saw 4 free chips and co-scheduled a second 4-chip job onto the same indivisible VM — two tenants colliding on one JAX host.The diagram:
Tighten validation at both ends:
ResourceConfig.with_tpu): candidates must share bothvm_countandchips_per_vm.[v4-8, v5p-8](both 1×4) still works;[v6e-4, v6e-8](1×4 vs 1×8) now fails fast with a clear message.launch_jobingestion): newvalidate_tpu_request()helper runs right after constraint injection and rejects chip-count / VM-shape mismatches withINVALID_ARGUMENT, so older or hand-rolled clients can't bypass the fray-side check.Test plan
uv run pytest lib/fray/tests/test_v2_iris.py(24 pass; new cases forv6e-4 + v6e-8rejection andv4-8 + v5p-8acceptance)uv run pytest lib/iris/tests/cluster/controller/test_service.py lib/iris/tests/cluster/test_constraints.py(82 pass; new cases for chip-count mismatch, mixed-shape rejection, matched-shape acceptance)./infra/pre-commit.py --all-files --fixCloses the user report from Michael Ryan re:
v6e-8co-scheduling collision.