Skip to content

iris, fray: reject TPU requests whose chip count doesn't match VM shape#4791

Merged
rjpower merged 2 commits intomainfrom
claude/friendly-varahamihira
Apr 15, 2026
Merged

iris, fray: reject TPU requests whose chip count doesn't match VM shape#4791
rjpower merged 2 commits intomainfrom
claude/friendly-varahamihira

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented Apr 15, 2026

Summary

A TPU VM is the atomic scheduling unit, but neither the Fray client nor the Iris controller rejected requests where the per-replica chip count differed from the variant's chips_per_vm. The concrete failure mode from the user report:

I keep getting scheduled to nodes that already have something running on them... This particular job only needs 4 chips and says so in the request, but I included v6e-8 as a possible TPU type to claim.

A job submitted with device_alternatives=[\"v6e-4\", \"v6e-8\"] passed the old vm_count-only check (both are vm_count=1), reserved 4 chips per replica against the primary, and then landed on a v6e-8 worker that advertises 8 chips. The scheduler saw 4 free chips and co-scheduled a second 4-chip job onto the same indivisible VM — two tenants colliding on one JAX host.

The diagram:

with_tpu([\"v6e-4\", \"v6e-8\"])  ← old check passes (vm_count both = 1)
  primary = v6e-4 → reserve chips_per_vm = 4
  ↓
lands on v6e-8 worker (advertises 8 chips, 1 VM)
  reserved=4, free=4 ✘ scheduler thinks VM is half-free
  → second 4-chip job co-scheduled onto the same VM → collision

Tighten validation at both ends:

  • fray (ResourceConfig.with_tpu): candidates must share both vm_count and chips_per_vm. [v4-8, v5p-8] (both 1×4) still works; [v6e-4, v6e-8] (1×4 vs 1×8) now fails fast with a clear message.
  • iris (launch_job ingestion): new validate_tpu_request() helper runs right after constraint injection and rejects chip-count / VM-shape mismatches with INVALID_ARGUMENT, so older or hand-rolled clients can't bypass the fray-side check.

Test plan

  • uv run pytest lib/fray/tests/test_v2_iris.py (24 pass; new cases for v6e-4 + v6e-8 rejection and v4-8 + v5p-8 acceptance)
  • uv run pytest lib/iris/tests/cluster/controller/test_service.py lib/iris/tests/cluster/test_constraints.py (82 pass; new cases for chip-count mismatch, mixed-shape rejection, matched-shape acceptance)
  • ./infra/pre-commit.py --all-files --fix

Closes the user report from Michael Ryan re: v6e-8 co-scheduling collision.

A TPU VM is the atomic scheduling unit, but neither the Fray client nor
the Iris controller rejected requests where the per-replica chip count
differed from the variant's chips_per_vm. The concrete failure mode:
a job submitted with device_alternatives=["v6e-4", "v6e-8"] (both have
vm_count=1, so the old check passed) would reserve 4 chips per replica
but land on a v6e-8 worker that advertises 8 chips. The scheduler then
saw 4 free chips and co-scheduled a second 4-chip job onto the same
indivisible VM, causing two tenants to collide on one JAX host.

Tighten validation at both ends:

- fray: `ResourceConfig.with_tpu()` now requires candidates to share
  both vm_count and chips_per_vm. Mixing v4-8 and v5p-8 (both 1x4)
  still works; mixing v6e-4 and v6e-8 (1x4 vs 1x8) now fails fast.

- iris: `launch_job` runs a new `validate_tpu_request()` check at RPC
  ingestion that rejects chip-count/VM-shape mismatches with
  INVALID_ARGUMENT, so older or hand-rolled clients can't bypass the
  fray-side check.
@rjpower rjpower added the agent-generated Created by automation/agent label Apr 15, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @rjpower's task in 2m 37s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notes from the review:

  • Validation logic in validate_tpu_request correctly mirrors the tpu_device default (count == chips_per_vm), so the per-replica chip-count check at lib/iris/src/iris/cluster/constraints.py:761 matches the proto contract.

  • The local from iris.cluster.types import get_tpu_topology at line 740 is justified — iris.cluster.types already imports Constraint from iris.cluster.constraints, so a top-level import would create a cycle (matches existing precedent at line 1140).

  • Defense-in-depth at both fray (with_tpu) and iris (launch_job) is appropriate; older or hand-rolled clients can't bypass the check.

  • Test coverage covers the three meaningful axes: chip-count mismatch, mixed-shape alternatives rejection, and matched-shape (cross-generation) acceptance.

  • Check PR review-ability

  • Identify relevant AGENTS.md files

  • Summarize PR changes

  • Run review (bugs + AGENTS.md compliance)

  • Validate findings

  • Post review
    Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 027be1e593

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread lib/iris/src/iris/cluster/constraints.py Outdated
An explicit `device-variant` constraint is authoritative for scheduling
(it replaces the auto-generated constraint from the primary variant),
so checking chips_per_vm only against the primary missed the case where
a request with primary v6e-4 and `device-variant EQ v6e-8` would still
schedule onto a v6e-8 VM while reserving only 4 of its 8 chips.

Loop the chip-count check over every effective candidate, and add a
regression test for the primary-v6e-4 / constraint-v6e-8 override case.
@rjpower
Copy link
Copy Markdown
Collaborator Author

rjpower commented Apr 15, 2026

🤖 Re Codex P1 on constraints.py:755 — good catch, fixed in f987eca. The chip-count check now loops over every effective candidate from the device-variant constraint (falling back to the primary when unconstrained), so a request with primary v6e-4 + device-variant EQ v6e-8 is rejected. Regression test added: test_launch_job_rejects_variant_override_with_smaller_primary.

@rjpower rjpower requested a review from XenonMolecule April 15, 2026 21:11
@rjpower rjpower enabled auto-merge (squash) April 15, 2026 21:19
Copy link
Copy Markdown

@XenonMolecule XenonMolecule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks Russell!

@rjpower rjpower merged commit 31e8ee3 into main Apr 15, 2026
39 of 40 checks passed
@rjpower rjpower deleted the claude/friendly-varahamihira branch April 15, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants