Skip to content

[iris] Coscheduler places concurrently-dispatched gangs on identical host sets → JAX coordinator port death loop #5470

Description

@ahmeda14960

TL;DR

When two coscheduled multi-host TPU jobs are dispatched on the same iris cluster within ~2–10 seconds of each other, requesting the same (TPU variant, region, priority band), the iris allocator can place both jobs on the exact same physical worker hosts. Both jobs then race on the JAX distributed coordinator port (8476), each preempting/restarting the other repeatedly. The job tasks accumulate thousands of preemption cycles, never produce stable training, and eventually iris terminates them with failures=1, preemptions=N (large), worker_failed=∼all.

This is not flakiness — it is a deterministic placement collision tied to simultaneous dispatch. Confirmed 4 times in 36 hours (2026-05-03 → 2026-05-04), on v5p-64, v5p-128, and v5p-256, with identical signatures. Estimated cost so far: ~1100 v5p host-hours of wasted compute plus ~1 day of operator misdiagnosis time.

The race window is at dispatch time, not just submission time. Two jobs submitted hours apart can still collide if their dispatch decisions land in the same iris scheduler tick when capacity opens (incident D below).

Failure mechanics — why identical placement is fatal

Each train_lm task on a TPU host expects to bind/reach port 8476 (endpoint_name=jax_coordinator) on the worker VM. The full bootstrap line:

iris.runtime.jax_init initialize_jax bootstrap inputs:
  task_index=0 num_tasks=32 advertise_host=10.202.0.195
  ports={} endpoint_name=jax_coordinator requested_port=8476
  env={'IRIS_TASK_ID': '/ahmedah/.../train_lm/0:N', 'IRIS_NUM_TASKS': '32'}

When two jobs each bring up a task at the same host:port, exactly one binds; the other sees the port busy or sees an iris service-registry conflict on endpoint_name=jax_coordinator. iris recycles the loser (preemptions++). The recycled task starts up, sees the same conflict, dies again. Meanwhile peer tasks of both jobs are polling _poll_for_coordinator (lib/iris/src/iris/runtime/jax_init.py:172) with a 300s timeout. Some hit TimeoutError, marking their task as worker_failed. Eventually failures=1 propagates to the parent → terminal failure for both jobs.

Symptoms in logs:

ConnectError: Stale attempt: task .../train_lm/0 attempt 12 != current 14
TimeoutError: Timed out after 300.0s waiting for coordinator endpoint 'jax_coordinator'
RuntimeError: Build failed with exit_code=2  (cascade-induced, not root cause)
absl::Status: INVALID_ARGUMENT: Unexpected task registered with task_name=/job:/replica:0/task:7
absl::Status: ALREADY_EXISTS: Aborted connect attempt as there is a request from a newer incarnation

In incident B these errors appeared mid-run after 9.5 hours of uptime, not at startup. The collision is latent — both jobs may run nominally for hours, then a single host-level transient (e.g. one task's setup retry, a brief preemption) triggers the dual-coordinator race and the system never recovers because both jobs keep trying to bind the same port.

Reproducer signature for future detection

A failed run is a likely placement-collision victim if all of these hold:

  1. Identical metrics on two failed jobs: preemptions and worker_failed match exactly between the two jobs.
  2. train_lm dispatch times within ≤10 seconds of each other (NOT submission time — jobs queued for hours can still collide at dispatch).
  3. Same TPU variant + region + priority band in their job_config rows.
  4. Same iris allocator scheduler tick (verified by close started_at_ms on the train_lm child).
  5. Bootstrap-log host comparison shows shared advertise_host IPs across task indices (gold-standard confirmation when logs are still available).
  6. No higher-priority external user active on that pool (rules out preemption-eviction as the explanation; this happened on v5p-256 where no other tenant had jobs).

preemptions count appears to scale with gang size, ~95-100 cycles per task before iris terminates:

Gang size TPU Preemption count
8 tasks v5p-64 707
16 tasks v5p-128 1515
32 tasks v5p-256 3131

The 1515/3131 numbers are deterministic across multiple incidents (B, C have identical 3131; D has 1515 on both jobs). This may relate to max_retries_preemption=1000 in job_config but the actual termination heuristic is different (terminates a few hundred cycles after that cap would imply).

Evidence — bootstrap log host comparison

The cleanest proof of the collision (incident B, v5p-256):

diff -q <(grep advertise_host lr0.5_log | sort -u) \
        <(grep advertise_host lr0.67_log | sort -u)
→ "IDENTICAL host sets" (31/31 hosts shared)

Verbatim host assignments for the two failed jobs:

lr0.5  task 0  advertise_host=10.202.0.195    lr0.67 task 0  advertise_host=10.202.0.195
lr0.5  task 10 advertise_host=10.202.0.219    lr0.67 task 10 advertise_host=10.202.0.219
lr0.5  task 11 advertise_host=10.202.1.5      lr0.67 task 11 advertise_host=10.202.1.5
lr0.5  task 12 advertise_host=10.202.0.185    lr0.67 task 12 advertise_host=10.202.0.185
lr0.5  task 13 advertise_host=10.202.0.222    lr0.67 task 13 advertise_host=10.202.0.222
lr0.5  task 14 advertise_host=10.202.1.85     lr0.67 task 14 advertise_host=10.202.1.85
lr0.5  task 15 advertise_host=10.202.1.1      lr0.67 task 15 advertise_host=10.202.1.1
lr0.5  task 16 advertise_host=10.202.1.82     lr0.67 task 16 advertise_host=10.202.1.82
lr0.5  task 17 advertise_host=10.202.1.75     lr0.67 task 17 advertise_host=10.202.1.75
lr0.5  task 18 advertise_host=10.202.0.247    lr0.67 task 18 advertise_host=10.202.0.247

10/10 sampled task indices on the same physical hosts. Full set diff confirmed 31/31 match.

Why a third co-submitted job survived (incident B)

Best hypothesis without iris allocator source code:

  • iris commits placement decisions FIFO. The lr0.33 job in incident B was 2.3s ahead of lr0.5 on submission, so its placement was already committed before the lr0.5 / lr0.67 race.
  • When lr0.5 and lr0.67 entered the scheduler within 2.3s of each other, iris's allocator may not have yet flushed lr0.5's reservation when it processed lr0.67's request, allowing both to claim the same host set.

Consistent with the survivor pattern (3 simultaneous, oldest survives, last two collide) and with general scheduler race-condition shapes.

Mitigations / workarounds

Until the allocator is fixed, the operational workaround is serialize coscheduled submissions across the same (variant, region, priority) tuple:

  1. Submit job A.
  2. Wait for A's train_lm child to be in running state (iris job summary <job>/train_lm shows state: running and preemptions=0 or 1).
  3. Submit job B.
  4. Repeat per job.

Alternatives:

  • Submit at different priority bands so iris's allocator doesn't race them through the same scheduler tick.
  • Pin to different zones with --zone us-east5-a vs --zone us-east5-b (only works if the TPU pool has multiple zones with capacity).
  • Submit with delay: a wrapper that introduces a 60s sleep between consecutive submissions in the same TPU group. (Empirically, submissions ≥1 minute apart in incident B did not collide.)

Important: for pending-queue scenarios (jobs queued behind capacity), submission-time spacing is not sufficient — incident D shows two jobs submitted 3h47m apart still collided when iris finally dispatched both within 5s of each other.

Action items / open questions

  1. Investigate iris allocator placement logic for races between concurrent coscheduled gang reservations. Suspected files:

    • lib/iris/src/iris/scheduler/ — coscheduling and placement logic
    • lib/iris/src/iris/cli/job.py — submission flow
    • Search for coscheduling_group_by, tpu-name constraint handling, host reservation commit semantics
  2. Add a placement-uniqueness assertion at coschedule commit time: each host should have at most one active tpu-name coscheduling group at any given moment. If two reservations claim overlapping hosts, fail the second with Scheduler: host already reserved instead of double-booking.

  3. Add a metric counting double-booked-host events per scheduling tick. Should be 0; the bug shows it can be >0.

  4. Audit TASK_STATE_BUILDING transitions: are tasks being re-placed onto contested hosts during recycle? If so, the recycler should also check uniqueness.

  5. Audit dispatch-queue race window: pending jobs that get dispatched in the same tick when capacity opens are exposed to this race even when their submissions were minutes/hours apart. The fix needs to apply at dispatch commit, not submission commit.

Database schema pointers (controller SQLite)

  • jobs — job lifecycle states (1=submitted/pending, 3=running, 4=succeeded, 5=failed, 6=cancelled)
  • job_config — submission spec including coscheduling_group_by, priority_band, res_device_json, constraints_json
  • tasks — task-level state per job
  • task_attempts — each recycle creates a new attempt; this is where the 1515/3131 preemption counts accumulate
  • task_resource_history — host bindings per attempt; might reveal the placement decision tree
  • worker_task_history — inverse: which tasks a worker has seen
  • dispatch_queue — pending dispatch decisions; race window candidate

Useful diagnostic queries

# Check for identical preemption + worker_failed across job pairs
uv run iris --controller-url=http://localhost:10000 --cluster=marin job summary <job-A>/train_lm
uv run iris --controller-url=http://localhost:10000 --cluster=marin job summary <job-B>/train_lm

# Confirm host-collision: extract advertise_host from bootstrap logs
uv run iris --controller-url=http://localhost:10000 --cluster=marin job logs <job-A>/train_lm \
  --max-lines 5000 2>&1 | grep -oE "advertise_host=10\.[0-9.]+" | sort -u > /tmp/A.hosts
uv run iris --controller-url=http://localhost:10000 --cluster=marin job logs <job-B>/train_lm \
  --max-lines 5000 2>&1 | grep -oE "advertise_host=10\.[0-9.]+" | sort -u > /tmp/B.hosts
diff /tmp/A.hosts /tmp/B.hosts
# IDENTICAL output → placement collision confirmed

# Bug-report dump (more diagnostic detail)
uv run iris --controller-url=http://localhost:10000 --cluster=marin job bug-report <job>
-- Find pairs of failed jobs with matching preemption + worker_failed (collision suspects)
SELECT a.name AS job_a, b.name AS job_b, ja.preemptions, ja.failures
FROM jobs a JOIN jobs b ON a.parent_job_id = b.parent_job_id
WHERE a.state=5 AND b.state=5 AND a.job_id < b.job_id
  AND ABS(a.started_at_ms - b.started_at_ms) < 10000
LIMIT 20;

-- Find dispatch ticks: starts grouped by 5-second buckets
SELECT
  started_at_ms / 5000 as tick_5s,
  count(*) as n_started,
  group_concat(name, ' | ') as jobs_started
FROM jobs
WHERE state IN (3,4,5) AND started_at_ms IS NOT NULL
GROUP BY tick_5s
HAVING n_started >= 2
ORDER BY tick_5s DESC LIMIT 20;

Cost estimate

Incident TPU Wasted host-hours
A (2026-05-03 ~01:21 UTC) v5p-64 ~1.3
B (2026-05-03 ~01:38 UTC, ran 9.5h thrash) v5p-256 ~608
C (2026-05-03 ~01:41 UTC, mid-run failure) v5p-256 tens
D (2026-05-04 12:19 UTC dispatch) v5p-128 ~11

Combined ~1100+ v5p host-hours plus ~1 day of operator time on misdiagnosis (initially blamed on external priority-band-2 contention from another tenant; the placement collision is the actual root cause).

Related

  • Internal postmortem: .agents/ops/iris_placement_bug.md in the midtraining worktree (this issue is lifted from that postmortem).
  • [levanter] Avoid all-rank mirror checkpoint staging #5374 — MirrorFS / cross-region transfer issue (adjacent, same training jobs, unrelated root cause).
  • Resume-namespace bug compounded incident A: see .agents/ops/2026-05-02-delphi-midtrain-resume-namespace.md.

Addendum: affected job IDs

All jobs are under /ahmedah/ on the marin iris cluster.

Incident A — v5p-64, 2026-05-03 ~01:21 UTC

Two simultaneous submissions, both failed with preemptions=707, worker_failed=7/8 each:

  • delphi-1e21-p67m33-lr0p5-batch64-resume2-20260503-012053
  • delphi-1e21-p67m33-lr0p67-batch64-resume2-20260503-012053

Submission times within seconds of each other; bootstrap logs showed 8/8 task indices on identical hosts. Compounded by /moojink/ priority-band-2 batch contention on the same TPU pool, which contributed external preemption pressure but is not the root cause.

Earlier failed v5p-64 runs from the same session that almost certainly hit the same bug class:

  • delphi-1e21-p67m33-lr0p5-batch64-resume-20260502-151756
  • delphi-1e21-p67m33-lr0p67-batch64-resume-20260502-161841
  • 6 v5p-64 batch-band-3 jobs at 20260503-014225

Incident B — v5p-256, 2026-05-03 submitted ~01:38 UTC, ran 09:14–18:43 UTC

Three p50m50 jobs in a tight ~3.6s submission window. Oldest survived; younger two collided.

Submission timestamps (epoch ms):

lr0.33 submitted: ~1777772314000  (oldest, +0s)
lr0.5  submitted: 1777772316424   (+2.4s)
lr0.67 submitted: 1777772318732   (+4.7s)

train_lm dispatch:

lr0.5  train_lm started: 1777799671408  (09:14:31 UTC)
lr0.67 train_lm started: 1777799674124  (09:14:34 UTC)  → +2.7s gap

End-state metrics on the two failed jobs (failed mid-run after 9.5h of clean uptime):

Metric lr0.5 lr0.67
state failed failed
failures 1 1
preemptions 3131 3131
worker_failed 31/32 31/32
train_lm wall lifetime 569 min 569 min

Bootstrap log diff: 31/31 hosts identical between lr0.5 and lr0.67.

Failed:

  • delphi-1e22-p50m50-lr0p5-batch256-20260503-013817
  • delphi-1e22-p50m50-lr0p67-batch256-20260503-013817

Survived (control, ahead by 2.3s):

  • delphi-1e22-p50m50-lr0p33-batch256-20260503-013817preemptions=1, worker_failed=0

Other v5p-256 1e22 jobs in the same general fleshing-out batch — all submitted with ≥1 minute gaps from each other, all running fine — confirming the bug is specific to the simultaneous-dispatch window:

  • delphi-1e22-p50m50-lr0p83-batch256-20260503-013949
  • delphi-1e22-p67m33-lr0p83-batch256-20260503-014108
  • delphi-1e22-p33m67-lr0p83-batch256-20260503-014108 (this one later failed — see incident C)

Incident C — v5p-256, 2026-05-03 ~01:41 UTC submission

Two jobs submitted within ~2.6 seconds. One survived; the other collided with the survivor (or with the lr0.5/lr0.67 jobs from incident B which were still in their thrash phase — exact pairing unclear).

Submission timestamps (epoch ms):

p67m33-lr0.83 submitted: 1777772482016  (control, +0s)
p33m67-lr0.83 submitted: 1777772484621  (+2.6s)

End-state metrics:

Metric p33m67-lr0.83 (failed) p67m33-lr0.83 (control)
state failed succeeded
failures 1 0
preemptions 3131 (low)
worker_failed 31/32 0/32

Identical preemptions=3131, worker_failed=31 to incident B's lr0.5/lr0.67 case.

Failed:

  • delphi-1e22-p33m67-lr0p83-batch256-20260503-014108 (last permanent checkpoint at step-6112 in gs://marin-us-east5/checkpoints/delphi-1e22-p33m67-32p07b-lr0.83-78fd44/checkpoints/; salvageable via resume)

Survived (control):

  • delphi-1e22-p67m33-lr0p83-batch256-20260503-014108

Incident D — v5p-128, dispatch-time collision (2026-05-04 12:19 UTC)

The dispatch-time-collision variant. Two jobs submitted 3h47m apart but iris dispatched both within ~5s of each other when v5p-128 capacity finally opened.

Submission timestamps (epoch ms):

p67m33-lr0.67 resume2 submitted: 1777824878720  (16:14:38 UTC May 3)
p33m67-lr0.83          submitted: 1777838540690  (20:02:20 UTC May 3)
                                                  → 3h 47m gap at submission

train_lm dispatch:

p67m33-lr0.67 train_lm started: 1777897156854  (12:19:16 UTC May 4)
p33m67-lr0.83 train_lm started: 1777897161741  (12:19:21 UTC May 4)
                                                → 4.9s gap at dispatch

End-state metrics:

Metric p67m33-lr0.67 resume2 p33m67-lr0.83
state failed failed
failures 1 1
preemptions 1515 1515
worker_failed 15/16 15/16
train_lm wall lifetime 21 min 37 min

The 1515 number is consistent with a 16-task gang on v5p-128 going through ~95 thrash cycles per task — same ratio as v5p-64 (707/8≈88) and v5p-256 (3131/32≈98).

Failed:

  • delphi-1e21-p67m33-lr0p67-batch128-resume2-20260503-161414 (was attempting to resume from gs://marin-us-east5/checkpoints/delphi-1e21-p67m33-9p25b-lr0.67-ecbd27/, latest perm step-2646)
  • delphi-1e21-p33m67-lr0p83-batch128-20260503-200203 (no checkpoint saved; would need fresh run from base — namespace gs://marin-us-east5/checkpoints/delphi-1e21-p33m67-9p25b-lr0.83-0cb048/ with empty checkpoints/)

🤖 Filed by Claude Code on behalf of @ahmeda14960 . Source: .agents/ops/iris_placement_bug.md in the midtrain_data worktree.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentbugSomething isn't workinginfrastructuretpuUsed for dispatching the TPU tests in CI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions