You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When two coscheduled multi-host TPU jobs are dispatched on the same iris cluster within ~2–10 seconds of each other, requesting the same (TPU variant, region, priority band), the iris allocator can place both jobs on the exact same physical worker hosts. Both jobs then race on the JAX distributed coordinator port (8476), each preempting/restarting the other repeatedly. The job tasks accumulate thousands of preemption cycles, never produce stable training, and eventually iris terminates them with failures=1, preemptions=N (large), worker_failed=∼all.
This is not flakiness — it is a deterministic placement collision tied to simultaneous dispatch. Confirmed 4 times in 36 hours (2026-05-03 → 2026-05-04), on v5p-64, v5p-128, and v5p-256, with identical signatures. Estimated cost so far: ~1100 v5p host-hours of wasted compute plus ~1 day of operator misdiagnosis time.
The race window is at dispatch time, not just submission time. Two jobs submitted hours apart can still collide if their dispatch decisions land in the same iris scheduler tick when capacity opens (incident D below).
Failure mechanics — why identical placement is fatal
Each train_lm task on a TPU host expects to bind/reach port 8476 (endpoint_name=jax_coordinator) on the worker VM. The full bootstrap line:
When two jobs each bring up a task at the same host:port, exactly one binds; the other sees the port busy or sees an iris service-registry conflict on endpoint_name=jax_coordinator. iris recycles the loser (preemptions++). The recycled task starts up, sees the same conflict, dies again. Meanwhile peer tasks of both jobs are polling _poll_for_coordinator (lib/iris/src/iris/runtime/jax_init.py:172) with a 300s timeout. Some hit TimeoutError, marking their task as worker_failed. Eventually failures=1 propagates to the parent → terminal failure for both jobs.
Symptoms in logs:
ConnectError: Stale attempt: task .../train_lm/0 attempt 12 != current 14
TimeoutError: Timed out after 300.0s waiting for coordinator endpoint 'jax_coordinator'
RuntimeError: Build failed with exit_code=2 (cascade-induced, not root cause)
absl::Status: INVALID_ARGUMENT: Unexpected task registered with task_name=/job:/replica:0/task:7
absl::Status: ALREADY_EXISTS: Aborted connect attempt as there is a request from a newer incarnation
In incident B these errors appeared mid-run after 9.5 hours of uptime, not at startup. The collision is latent — both jobs may run nominally for hours, then a single host-level transient (e.g. one task's setup retry, a brief preemption) triggers the dual-coordinator race and the system never recovers because both jobs keep trying to bind the same port.
Reproducer signature for future detection
A failed run is a likely placement-collision victim if all of these hold:
Identical metrics on two failed jobs: preemptions and worker_failed match exactly between the two jobs.
train_lm dispatch times within ≤10 seconds of each other (NOT submission time — jobs queued for hours can still collide at dispatch).
Same TPU variant + region + priority band in their job_config rows.
Same iris allocator scheduler tick (verified by close started_at_ms on the train_lm child).
Bootstrap-log host comparison shows shared advertise_host IPs across task indices (gold-standard confirmation when logs are still available).
No higher-priority external user active on that pool (rules out preemption-eviction as the explanation; this happened on v5p-256 where no other tenant had jobs).
preemptions count appears to scale with gang size, ~95-100 cycles per task before iris terminates:
Gang size
TPU
Preemption count
8 tasks
v5p-64
707
16 tasks
v5p-128
1515
32 tasks
v5p-256
3131
The 1515/3131 numbers are deterministic across multiple incidents (B, C have identical 3131; D has 1515 on both jobs). This may relate to max_retries_preemption=1000 in job_config but the actual termination heuristic is different (terminates a few hundred cycles after that cap would imply).
Evidence — bootstrap log host comparison
The cleanest proof of the collision (incident B, v5p-256):
10/10 sampled task indices on the same physical hosts. Full set diff confirmed 31/31 match.
Why a third co-submitted job survived (incident B)
Best hypothesis without iris allocator source code:
iris commits placement decisions FIFO. The lr0.33 job in incident B was 2.3s ahead of lr0.5 on submission, so its placement was already committed before the lr0.5 / lr0.67 race.
When lr0.5 and lr0.67 entered the scheduler within 2.3s of each other, iris's allocator may not have yet flushed lr0.5's reservation when it processed lr0.67's request, allowing both to claim the same host set.
Consistent with the survivor pattern (3 simultaneous, oldest survives, last two collide) and with general scheduler race-condition shapes.
Mitigations / workarounds
Until the allocator is fixed, the operational workaround is serialize coscheduled submissions across the same (variant, region, priority) tuple:
Submit job A.
Wait for A's train_lm child to be in running state (iris job summary <job>/train_lm shows state: running and preemptions=0 or 1).
Submit job B.
Repeat per job.
Alternatives:
Submit at different priority bands so iris's allocator doesn't race them through the same scheduler tick.
Pin to different zones with --zone us-east5-a vs --zone us-east5-b (only works if the TPU pool has multiple zones with capacity).
Submit with delay: a wrapper that introduces a 60s sleep between consecutive submissions in the same TPU group. (Empirically, submissions ≥1 minute apart in incident B did not collide.)
Important: for pending-queue scenarios (jobs queued behind capacity), submission-time spacing is not sufficient — incident D shows two jobs submitted 3h47m apart still collided when iris finally dispatched both within 5s of each other.
Action items / open questions
Investigate iris allocator placement logic for races between concurrent coscheduled gang reservations. Suspected files:
lib/iris/src/iris/scheduler/ — coscheduling and placement logic
lib/iris/src/iris/cli/job.py — submission flow
Search for coscheduling_group_by, tpu-name constraint handling, host reservation commit semantics
Add a placement-uniqueness assertion at coschedule commit time: each host should have at most one active tpu-name coscheduling group at any given moment. If two reservations claim overlapping hosts, fail the second with Scheduler: host already reserved instead of double-booking.
Add a metric counting double-booked-host events per scheduling tick. Should be 0; the bug shows it can be >0.
Audit TASK_STATE_BUILDING transitions: are tasks being re-placed onto contested hosts during recycle? If so, the recycler should also check uniqueness.
Audit dispatch-queue race window: pending jobs that get dispatched in the same tick when capacity opens are exposed to this race even when their submissions were minutes/hours apart. The fix needs to apply at dispatch commit, not submission commit.
Database schema pointers (controller SQLite)
jobs — job lifecycle states (1=submitted/pending, 3=running, 4=succeeded, 5=failed, 6=cancelled)
job_config — submission spec including coscheduling_group_by, priority_band, res_device_json, constraints_json
tasks — task-level state per job
task_attempts — each recycle creates a new attempt; this is where the 1515/3131 preemption counts accumulate
task_resource_history — host bindings per attempt; might reveal the placement decision tree
worker_task_history — inverse: which tasks a worker has seen
-- Find pairs of failed jobs with matching preemption + worker_failed (collision suspects)SELECTa.nameAS job_a, b.nameAS job_b, ja.preemptions, ja.failuresFROM jobs a JOIN jobs b ONa.parent_job_id=b.parent_job_idWHEREa.state=5ANDb.state=5ANDa.job_id<b.job_idAND ABS(a.started_at_ms-b.started_at_ms) <10000LIMIT20;
-- Find dispatch ticks: starts grouped by 5-second bucketsSELECT
started_at_ms /5000as tick_5s,
count(*) as n_started,
group_concat(name, ' | ') as jobs_started
FROM jobs
WHERE state IN (3,4,5) AND started_at_ms IS NOT NULLGROUP BY tick_5s
HAVING n_started >=2ORDER BY tick_5s DESCLIMIT20;
Cost estimate
Incident
TPU
Wasted host-hours
A (2026-05-03 ~01:21 UTC)
v5p-64
~1.3
B (2026-05-03 ~01:38 UTC, ran 9.5h thrash)
v5p-256
~608
C (2026-05-03 ~01:41 UTC, mid-run failure)
v5p-256
tens
D (2026-05-04 12:19 UTC dispatch)
v5p-128
~11
Combined ~1100+ v5p host-hours plus ~1 day of operator time on misdiagnosis (initially blamed on external priority-band-2 contention from another tenant; the placement collision is the actual root cause).
Related
Internal postmortem: .agents/ops/iris_placement_bug.md in the midtraining worktree (this issue is lifted from that postmortem).
Submission times within seconds of each other; bootstrap logs showed 8/8 task indices on identical hosts. Compounded by /moojink/ priority-band-2 batch contention on the same TPU pool, which contributed external preemption pressure but is not the root cause.
Earlier failed v5p-64 runs from the same session that almost certainly hit the same bug class:
Other v5p-256 1e22 jobs in the same general fleshing-out batch — all submitted with ≥1 minute gaps from each other, all running fine — confirming the bug is specific to the simultaneous-dispatch window:
delphi-1e22-p33m67-lr0p83-batch256-20260503-014108 (this one later failed — see incident C)
Incident C — v5p-256, 2026-05-03 ~01:41 UTC submission
Two jobs submitted within ~2.6 seconds. One survived; the other collided with the survivor (or with the lr0.5/lr0.67 jobs from incident B which were still in their thrash phase — exact pairing unclear).
Identical preemptions=3131, worker_failed=31 to incident B's lr0.5/lr0.67 case.
Failed:
delphi-1e22-p33m67-lr0p83-batch256-20260503-014108 (last permanent checkpoint at step-6112 in gs://marin-us-east5/checkpoints/delphi-1e22-p33m67-32p07b-lr0.83-78fd44/checkpoints/; salvageable via resume)
Incident D — v5p-128, dispatch-time collision (2026-05-04 12:19 UTC)
The dispatch-time-collision variant. Two jobs submitted 3h47m apart but iris dispatched both within ~5s of each other when v5p-128 capacity finally opened.
Submission timestamps (epoch ms):
p67m33-lr0.67 resume2 submitted: 1777824878720 (16:14:38 UTC May 3)
p33m67-lr0.83 submitted: 1777838540690 (20:02:20 UTC May 3)
→ 3h 47m gap at submission
train_lm dispatch:
p67m33-lr0.67 train_lm started: 1777897156854 (12:19:16 UTC May 4)
p33m67-lr0.83 train_lm started: 1777897161741 (12:19:21 UTC May 4)
→ 4.9s gap at dispatch
End-state metrics:
Metric
p67m33-lr0.67 resume2
p33m67-lr0.83
state
failed
failed
failures
1
1
preemptions
1515
1515
worker_failed
15/16
15/16
train_lm wall lifetime
21 min
37 min
The 1515 number is consistent with a 16-task gang on v5p-128 going through ~95 thrash cycles per task — same ratio as v5p-64 (707/8≈88) and v5p-256 (3131/32≈98).
Failed:
delphi-1e21-p67m33-lr0p67-batch128-resume2-20260503-161414 (was attempting to resume from gs://marin-us-east5/checkpoints/delphi-1e21-p67m33-9p25b-lr0.67-ecbd27/, latest perm step-2646)
delphi-1e21-p33m67-lr0p83-batch128-20260503-200203 (no checkpoint saved; would need fresh run from base — namespace gs://marin-us-east5/checkpoints/delphi-1e21-p33m67-9p25b-lr0.83-0cb048/ with empty checkpoints/)
🤖 Filed by Claude Code on behalf of @ahmeda14960 . Source: .agents/ops/iris_placement_bug.md in the midtrain_data worktree.
TL;DR
When two coscheduled multi-host TPU jobs are dispatched on the same iris cluster within ~2–10 seconds of each other, requesting the same
(TPU variant, region, priority band), the iris allocator can place both jobs on the exact same physical worker hosts. Both jobs then race on the JAX distributed coordinator port (8476), each preempting/restarting the other repeatedly. The job tasks accumulate thousands of preemption cycles, never produce stable training, and eventually iris terminates them withfailures=1, preemptions=N (large), worker_failed=∼all.This is not flakiness — it is a deterministic placement collision tied to simultaneous dispatch. Confirmed 4 times in 36 hours (2026-05-03 → 2026-05-04), on v5p-64, v5p-128, and v5p-256, with identical signatures. Estimated cost so far: ~1100 v5p host-hours of wasted compute plus ~1 day of operator misdiagnosis time.
The race window is at dispatch time, not just submission time. Two jobs submitted hours apart can still collide if their dispatch decisions land in the same iris scheduler tick when capacity opens (incident D below).
Failure mechanics — why identical placement is fatal
Each
train_lmtask on a TPU host expects to bind/reachport 8476(endpoint_name=jax_coordinator) on the worker VM. The full bootstrap line:When two jobs each bring up a task at the same host:port, exactly one binds; the other sees the port busy or sees an iris service-registry conflict on
endpoint_name=jax_coordinator. iris recycles the loser (preemptions++). The recycled task starts up, sees the same conflict, dies again. Meanwhile peer tasks of both jobs are polling_poll_for_coordinator(lib/iris/src/iris/runtime/jax_init.py:172) with a 300s timeout. Some hitTimeoutError, marking their task asworker_failed. Eventuallyfailures=1propagates to the parent → terminal failure for both jobs.Symptoms in logs:
In incident B these errors appeared mid-run after 9.5 hours of uptime, not at startup. The collision is latent — both jobs may run nominally for hours, then a single host-level transient (e.g. one task's setup retry, a brief preemption) triggers the dual-coordinator race and the system never recovers because both jobs keep trying to bind the same port.
Reproducer signature for future detection
A failed run is a likely placement-collision victim if all of these hold:
preemptionsandworker_failedmatch exactly between the two jobs.train_lmdispatch times within ≤10 seconds of each other (NOT submission time — jobs queued for hours can still collide at dispatch).job_configrows.started_at_mson thetrain_lmchild).advertise_hostIPs across task indices (gold-standard confirmation when logs are still available).preemptionscount appears to scale with gang size, ~95-100 cycles per task before iris terminates:The 1515/3131 numbers are deterministic across multiple incidents (B, C have identical 3131; D has 1515 on both jobs). This may relate to
max_retries_preemption=1000injob_configbut the actual termination heuristic is different (terminates a few hundred cycles after that cap would imply).Evidence — bootstrap log host comparison
The cleanest proof of the collision (incident B, v5p-256):
Verbatim host assignments for the two failed jobs:
10/10 sampled task indices on the same physical hosts. Full set diff confirmed 31/31 match.
Why a third co-submitted job survived (incident B)
Best hypothesis without iris allocator source code:
lr0.33job in incident B was 2.3s ahead oflr0.5on submission, so its placement was already committed before thelr0.5/lr0.67race.lr0.5andlr0.67entered the scheduler within 2.3s of each other, iris's allocator may not have yet flushedlr0.5's reservation when it processedlr0.67's request, allowing both to claim the same host set.Consistent with the survivor pattern (3 simultaneous, oldest survives, last two collide) and with general scheduler race-condition shapes.
Mitigations / workarounds
Until the allocator is fixed, the operational workaround is serialize coscheduled submissions across the same
(variant, region, priority)tuple:train_lmchild to be inrunningstate (iris job summary <job>/train_lmshowsstate: runningandpreemptions=0or1).Alternatives:
--zone us-east5-avs--zone us-east5-b(only works if the TPU pool has multiple zones with capacity).Important: for pending-queue scenarios (jobs queued behind capacity), submission-time spacing is not sufficient — incident D shows two jobs submitted 3h47m apart still collided when iris finally dispatched both within 5s of each other.
Action items / open questions
Investigate iris allocator placement logic for races between concurrent coscheduled gang reservations. Suspected files:
lib/iris/src/iris/scheduler/— coscheduling and placement logiclib/iris/src/iris/cli/job.py— submission flowcoscheduling_group_by,tpu-nameconstraint handling, host reservation commit semanticsAdd a placement-uniqueness assertion at coschedule commit time: each host should have at most one active
tpu-namecoscheduling group at any given moment. If two reservations claim overlapping hosts, fail the second withScheduler: host already reservedinstead of double-booking.Add a metric counting
double-booked-hostevents per scheduling tick. Should be 0; the bug shows it can be >0.Audit
TASK_STATE_BUILDINGtransitions: are tasks being re-placed onto contested hosts during recycle? If so, the recycler should also check uniqueness.Audit dispatch-queue race window: pending jobs that get dispatched in the same tick when capacity opens are exposed to this race even when their submissions were minutes/hours apart. The fix needs to apply at dispatch commit, not submission commit.
Database schema pointers (controller SQLite)
jobs— job lifecycle states (1=submitted/pending, 3=running, 4=succeeded, 5=failed, 6=cancelled)job_config— submission spec includingcoscheduling_group_by,priority_band,res_device_json,constraints_jsontasks— task-level state per jobtask_attempts— each recycle creates a new attempt; this is where the 1515/3131 preemption counts accumulatetask_resource_history— host bindings per attempt; might reveal the placement decision treeworker_task_history— inverse: which tasks a worker has seendispatch_queue— pending dispatch decisions; race window candidateUseful diagnostic queries
Cost estimate
Combined ~1100+ v5p host-hours plus ~1 day of operator time on misdiagnosis (initially blamed on external priority-band-2 contention from another tenant; the placement collision is the actual root cause).
Related
.agents/ops/iris_placement_bug.mdin the midtraining worktree (this issue is lifted from that postmortem)..agents/ops/2026-05-02-delphi-midtrain-resume-namespace.md.Addendum: affected job IDs
All jobs are under
/ahmedah/on themariniris cluster.Incident A — v5p-64, 2026-05-03 ~01:21 UTC
Two simultaneous submissions, both failed with
preemptions=707, worker_failed=7/8each:delphi-1e21-p67m33-lr0p5-batch64-resume2-20260503-012053delphi-1e21-p67m33-lr0p67-batch64-resume2-20260503-012053Submission times within seconds of each other; bootstrap logs showed 8/8 task indices on identical hosts. Compounded by
/moojink/priority-band-2 batch contention on the same TPU pool, which contributed external preemption pressure but is not the root cause.Earlier failed v5p-64 runs from the same session that almost certainly hit the same bug class:
delphi-1e21-p67m33-lr0p5-batch64-resume-20260502-151756delphi-1e21-p67m33-lr0p67-batch64-resume-20260502-16184120260503-014225Incident B — v5p-256, 2026-05-03 submitted ~01:38 UTC, ran 09:14–18:43 UTC
Three p50m50 jobs in a tight ~3.6s submission window. Oldest survived; younger two collided.
Submission timestamps (epoch ms):
train_lmdispatch:End-state metrics on the two failed jobs (failed mid-run after 9.5h of clean uptime):
train_lmwall lifetimeBootstrap log diff: 31/31 hosts identical between lr0.5 and lr0.67.
Failed:
delphi-1e22-p50m50-lr0p5-batch256-20260503-013817delphi-1e22-p50m50-lr0p67-batch256-20260503-013817Survived (control, ahead by 2.3s):
delphi-1e22-p50m50-lr0p33-batch256-20260503-013817—preemptions=1, worker_failed=0Other v5p-256 1e22 jobs in the same general fleshing-out batch — all submitted with ≥1 minute gaps from each other, all running fine — confirming the bug is specific to the simultaneous-dispatch window:
delphi-1e22-p50m50-lr0p83-batch256-20260503-013949delphi-1e22-p67m33-lr0p83-batch256-20260503-014108delphi-1e22-p33m67-lr0p83-batch256-20260503-014108(this one later failed — see incident C)Incident C — v5p-256, 2026-05-03 ~01:41 UTC submission
Two jobs submitted within ~2.6 seconds. One survived; the other collided with the survivor (or with the lr0.5/lr0.67 jobs from incident B which were still in their thrash phase — exact pairing unclear).
Submission timestamps (epoch ms):
End-state metrics:
Identical
preemptions=3131, worker_failed=31to incident B's lr0.5/lr0.67 case.Failed:
delphi-1e22-p33m67-lr0p83-batch256-20260503-014108(last permanent checkpoint atstep-6112ings://marin-us-east5/checkpoints/delphi-1e22-p33m67-32p07b-lr0.83-78fd44/checkpoints/; salvageable via resume)Survived (control):
delphi-1e22-p67m33-lr0p83-batch256-20260503-014108Incident D — v5p-128, dispatch-time collision (2026-05-04 12:19 UTC)
The dispatch-time-collision variant. Two jobs submitted 3h47m apart but iris dispatched both within ~5s of each other when v5p-128 capacity finally opened.
Submission timestamps (epoch ms):
train_lmdispatch:End-state metrics:
train_lmwall lifetimeThe 1515 number is consistent with a 16-task gang on v5p-128 going through ~95 thrash cycles per task — same ratio as v5p-64 (707/8≈88) and v5p-256 (3131/32≈98).
Failed:
delphi-1e21-p67m33-lr0p67-batch128-resume2-20260503-161414(was attempting to resume fromgs://marin-us-east5/checkpoints/delphi-1e21-p67m33-9p25b-lr0.67-ecbd27/, latest perm step-2646)delphi-1e21-p33m67-lr0p83-batch128-20260503-200203(no checkpoint saved; would need fresh run from base — namespacegs://marin-us-east5/checkpoints/delphi-1e21-p33m67-9p25b-lr0.83-0cb048/with emptycheckpoints/)🤖 Filed by Claude Code on behalf of @ahmeda14960 . Source:
.agents/ops/iris_placement_bug.mdin themidtrain_dataworktree.