Commit 31e8ee3
authored
iris, fray: reject TPU requests whose chip count doesn't match VM shape (#4791)
## Summary
A TPU VM is the atomic scheduling unit, but neither the Fray client nor
the Iris controller rejected requests where the per-replica chip count
differed from the variant's `chips_per_vm`. The concrete failure mode
from the user report:
> I keep getting scheduled to nodes that already have something running
on them... This particular job only needs 4 chips and says so in the
request, but I included v6e-8 as a possible TPU type to claim.
A job submitted with `device_alternatives=[\"v6e-4\", \"v6e-8\"]` passed
the old `vm_count`-only check (both are vm_count=1), reserved 4 chips
per replica against the primary, and then landed on a v6e-8 worker that
advertises 8 chips. The scheduler saw 4 free chips and co-scheduled a
second 4-chip job onto the same indivisible VM — two tenants colliding
on one JAX host.
The diagram:
```
with_tpu([\"v6e-4\", \"v6e-8\"]) ← old check passes (vm_count both = 1)
primary = v6e-4 → reserve chips_per_vm = 4
↓
lands on v6e-8 worker (advertises 8 chips, 1 VM)
reserved=4, free=4 ✘ scheduler thinks VM is half-free
→ second 4-chip job co-scheduled onto the same VM → collision
```
Tighten validation at both ends:
- **fray** (`ResourceConfig.with_tpu`): candidates must share both
`vm_count` _and_ `chips_per_vm`. `[v4-8, v5p-8]` (both 1×4) still works;
`[v6e-4, v6e-8]` (1×4 vs 1×8) now fails fast with a clear message.
- **iris** (`launch_job` ingestion): new `validate_tpu_request()` helper
runs right after constraint injection and rejects chip-count / VM-shape
mismatches with `INVALID_ARGUMENT`, so older or hand-rolled clients
can't bypass the fray-side check.1 parent 913e579 commit 31e8ee3
File tree
5 files changed
+178
-6
lines changed- lib
- fray
- src/fray/v2
- tests
- iris
- src/iris/cluster
- controller
- tests/cluster/controller
5 files changed
+178
-6
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
358 | 358 | | |
359 | 359 | | |
360 | 360 | | |
361 | | - | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
362 | 366 | | |
363 | 367 | | |
364 | 368 | | |
| |||
368 | 372 | | |
369 | 373 | | |
370 | 374 | | |
371 | | - | |
372 | | - | |
373 | | - | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
374 | 385 | | |
375 | 386 | | |
376 | 387 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
246 | | - | |
| 246 | + | |
247 | 247 | | |
248 | 248 | | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
249 | 261 | | |
250 | 262 | | |
251 | 263 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
715 | 715 | | |
716 | 716 | | |
717 | 717 | | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
| 723 | + | |
| 724 | + | |
| 725 | + | |
| 726 | + | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
718 | 800 | | |
719 | 801 | | |
720 | 802 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| |||
1135 | 1135 | | |
1136 | 1136 | | |
1137 | 1137 | | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
1138 | 1146 | | |
1139 | 1147 | | |
1140 | 1148 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
124 | 124 | | |
125 | 125 | | |
126 | 126 | | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
127 | 186 | | |
128 | 187 | | |
129 | 188 | | |
| |||
0 commit comments