[iris] Coscheduler places concurrently-dispatched gangs on identical host sets → JAX coordinator port death loop

## TL;DR

When two coscheduled multi-host TPU jobs are dispatched on the same iris cluster within ~2–10 seconds of each other, requesting the same `(TPU variant, region, priority band)`, **the iris allocator can place both jobs on the exact same physical worker hosts.** Both jobs then race on the JAX distributed coordinator port (`8476`), each preempting/restarting the other repeatedly. The job tasks accumulate thousands of preemption cycles, never produce stable training, and eventually iris terminates them with `failures=1, preemptions=N (large), worker_failed=∼all`.

This is **not flakiness** — it is a deterministic placement collision tied to simultaneous dispatch. Confirmed **4 times in 36 hours** (2026-05-03 → 2026-05-04), on v5p-64, v5p-128, and v5p-256, with identical signatures. Estimated cost so far: ~**1100 v5p host-hours of wasted compute** plus ~1 day of operator misdiagnosis time.

The race window is at **dispatch time**, not just submission time. Two jobs submitted hours apart can still collide if their dispatch decisions land in the same iris scheduler tick when capacity opens (incident D below).

## Failure mechanics — why identical placement is fatal

Each `train_lm` task on a TPU host expects to bind/reach `port 8476` (`endpoint_name=jax_coordinator`) on the worker VM. The full bootstrap line:

```
iris.runtime.jax_init initialize_jax bootstrap inputs:
  task_index=0 num_tasks=32 advertise_host=10.202.0.195
  ports={} endpoint_name=jax_coordinator requested_port=8476
  env={'IRIS_TASK_ID': '/ahmedah/.../train_lm/0:N', 'IRIS_NUM_TASKS': '32'}
```

When two jobs each bring up a task at the same host:port, exactly one binds; the other sees the port busy or sees an iris service-registry conflict on `endpoint_name=jax_coordinator`. iris recycles the loser (`preemptions++`). The recycled task starts up, sees the same conflict, dies again. Meanwhile peer tasks of both jobs are polling `_poll_for_coordinator` (`lib/iris/src/iris/runtime/jax_init.py:172`) with a 300s timeout. Some hit `TimeoutError`, marking their task as `worker_failed`. Eventually `failures=1` propagates to the parent → terminal failure for both jobs.

Symptoms in logs:

```
ConnectError: Stale attempt: task .../train_lm/0 attempt 12 != current 14
TimeoutError: Timed out after 300.0s waiting for coordinator endpoint 'jax_coordinator'
RuntimeError: Build failed with exit_code=2  (cascade-induced, not root cause)
absl::Status: INVALID_ARGUMENT: Unexpected task registered with task_name=/job:/replica:0/task:7
absl::Status: ALREADY_EXISTS: Aborted connect attempt as there is a request from a newer incarnation
```

In incident B these errors appeared **mid-run after 9.5 hours of uptime**, not at startup. The collision is latent — both jobs may run nominally for hours, then a single host-level transient (e.g. one task's setup retry, a brief preemption) triggers the dual-coordinator race and the system never recovers because both jobs keep trying to bind the same port.

## Reproducer signature for future detection

A failed run is a likely placement-collision victim if **all** of these hold:

1. **Identical metrics on two failed jobs**: `preemptions` and `worker_failed` match exactly between the two jobs.
2. **`train_lm` dispatch times within ≤10 seconds** of each other (NOT submission time — jobs queued for hours can still collide at dispatch).
3. **Same TPU variant + region + priority band** in their `job_config` rows.
4. **Same iris allocator scheduler tick** (verified by close `started_at_ms` on the `train_lm` child).
5. **Bootstrap-log host comparison** shows shared `advertise_host` IPs across task indices (gold-standard confirmation when logs are still available).
6. **No higher-priority external user** active on that pool (rules out preemption-eviction as the explanation; this happened on v5p-256 where no other tenant had jobs).

`preemptions` count appears to scale with gang size, ~95-100 cycles per task before iris terminates:

| Gang size | TPU | Preemption count |
|---|---|---|
| 8 tasks | v5p-64 | 707 |
| 16 tasks | v5p-128 | 1515 |
| 32 tasks | v5p-256 | 3131 |

The 1515/3131 numbers are deterministic across multiple incidents (B, C have identical 3131; D has 1515 on both jobs). This may relate to `max_retries_preemption=1000` in `job_config` but the actual termination heuristic is different (terminates a few hundred cycles after that cap would imply).

## Evidence — bootstrap log host comparison

The cleanest proof of the collision (incident B, v5p-256):

```
diff -q <(grep advertise_host lr0.5_log | sort -u) \
        <(grep advertise_host lr0.67_log | sort -u)
→ "IDENTICAL host sets" (31/31 hosts shared)
```

Verbatim host assignments for the two failed jobs:

```
lr0.5  task 0  advertise_host=10.202.0.195    lr0.67 task 0  advertise_host=10.202.0.195
lr0.5  task 10 advertise_host=10.202.0.219    lr0.67 task 10 advertise_host=10.202.0.219
lr0.5  task 11 advertise_host=10.202.1.5      lr0.67 task 11 advertise_host=10.202.1.5
lr0.5  task 12 advertise_host=10.202.0.185    lr0.67 task 12 advertise_host=10.202.0.185
lr0.5  task 13 advertise_host=10.202.0.222    lr0.67 task 13 advertise_host=10.202.0.222
lr0.5  task 14 advertise_host=10.202.1.85     lr0.67 task 14 advertise_host=10.202.1.85
lr0.5  task 15 advertise_host=10.202.1.1      lr0.67 task 15 advertise_host=10.202.1.1
lr0.5  task 16 advertise_host=10.202.1.82     lr0.67 task 16 advertise_host=10.202.1.82
lr0.5  task 17 advertise_host=10.202.1.75     lr0.67 task 17 advertise_host=10.202.1.75
lr0.5  task 18 advertise_host=10.202.0.247    lr0.67 task 18 advertise_host=10.202.0.247
```

10/10 sampled task indices on the same physical hosts. Full set diff confirmed 31/31 match.

## Why a third co-submitted job survived (incident B)

Best hypothesis without iris allocator source code:

- iris commits placement decisions FIFO. The `lr0.33` job in incident B was 2.3s ahead of `lr0.5` on submission, so its placement was already committed before the `lr0.5` / `lr0.67` race.
- When `lr0.5` and `lr0.67` entered the scheduler within 2.3s of each other, iris's allocator may not have yet flushed `lr0.5`'s reservation when it processed `lr0.67`'s request, allowing both to claim the same host set.

Consistent with the survivor pattern (3 simultaneous, oldest survives, last two collide) and with general scheduler race-condition shapes.

## Mitigations / workarounds

Until the allocator is fixed, the operational workaround is **serialize coscheduled submissions** across the same `(variant, region, priority)` tuple:

1. Submit job A.
2. Wait for A's `train_lm` child to be in `running` state (`iris job summary <job>/train_lm` shows `state: running` and `preemptions=0` or `1`).
3. Submit job B.
4. Repeat per job.

Alternatives:

- **Submit at different priority bands** so iris's allocator doesn't race them through the same scheduler tick.
- **Pin to different zones** with `--zone us-east5-a` vs `--zone us-east5-b` (only works if the TPU pool has multiple zones with capacity).
- **Submit with delay**: a wrapper that introduces a 60s sleep between consecutive submissions in the same TPU group. (Empirically, submissions ≥1 minute apart in incident B did not collide.)

Important: for **pending-queue scenarios** (jobs queued behind capacity), submission-time spacing is **not sufficient** — incident D shows two jobs submitted 3h47m apart still collided when iris finally dispatched both within 5s of each other.

## Action items / open questions

1. **Investigate iris allocator placement logic** for races between concurrent coscheduled gang reservations. Suspected files:
   - `lib/iris/src/iris/scheduler/` — coscheduling and placement logic
   - `lib/iris/src/iris/cli/job.py` — submission flow
   - Search for `coscheduling_group_by`, `tpu-name` constraint handling, host reservation commit semantics

2. **Add a placement-uniqueness assertion at coschedule commit time**: each host should have at most one active `tpu-name` coscheduling group at any given moment. If two reservations claim overlapping hosts, fail the second with `Scheduler: host already reserved` instead of double-booking.

3. **Add a metric** counting `double-booked-host` events per scheduling tick. Should be 0; the bug shows it can be >0.

4. **Audit `TASK_STATE_BUILDING` transitions**: are tasks being re-placed onto contested hosts during recycle? If so, the recycler should also check uniqueness.

5. **Audit dispatch-queue race window**: pending jobs that get dispatched in the same tick when capacity opens are exposed to this race even when their submissions were minutes/hours apart. The fix needs to apply at dispatch commit, not submission commit.

## Database schema pointers (controller SQLite)

- `jobs` — job lifecycle states (1=submitted/pending, 3=running, 4=succeeded, 5=failed, 6=cancelled)
- `job_config` — submission spec including `coscheduling_group_by`, `priority_band`, `res_device_json`, `constraints_json`
- `tasks` — task-level state per job
- `task_attempts` — each recycle creates a new attempt; this is where the 1515/3131 preemption counts accumulate
- `task_resource_history` — host bindings per attempt; might reveal the placement decision tree
- `worker_task_history` — inverse: which tasks a worker has seen
- `dispatch_queue` — pending dispatch decisions; race window candidate

## Useful diagnostic queries

```bash
# Check for identical preemption + worker_failed across job pairs
uv run iris --controller-url=http://localhost:10000 --cluster=marin job summary <job-A>/train_lm
uv run iris --controller-url=http://localhost:10000 --cluster=marin job summary <job-B>/train_lm

# Confirm host-collision: extract advertise_host from bootstrap logs
uv run iris --controller-url=http://localhost:10000 --cluster=marin job logs <job-A>/train_lm \
  --max-lines 5000 2>&1 | grep -oE "advertise_host=10\.[0-9.]+" | sort -u > /tmp/A.hosts
uv run iris --controller-url=http://localhost:10000 --cluster=marin job logs <job-B>/train_lm \
  --max-lines 5000 2>&1 | grep -oE "advertise_host=10\.[0-9.]+" | sort -u > /tmp/B.hosts
diff /tmp/A.hosts /tmp/B.hosts
# IDENTICAL output → placement collision confirmed

# Bug-report dump (more diagnostic detail)
uv run iris --controller-url=http://localhost:10000 --cluster=marin job bug-report <job>
```

```sql
-- Find pairs of failed jobs with matching preemption + worker_failed (collision suspects)
SELECT a.name AS job_a, b.name AS job_b, ja.preemptions, ja.failures
FROM jobs a JOIN jobs b ON a.parent_job_id = b.parent_job_id
WHERE a.state=5 AND b.state=5 AND a.job_id < b.job_id
  AND ABS(a.started_at_ms - b.started_at_ms) < 10000
LIMIT 20;

-- Find dispatch ticks: starts grouped by 5-second buckets
SELECT
  started_at_ms / 5000 as tick_5s,
  count(*) as n_started,
  group_concat(name, ' | ') as jobs_started
FROM jobs
WHERE state IN (3,4,5) AND started_at_ms IS NOT NULL
GROUP BY tick_5s
HAVING n_started >= 2
ORDER BY tick_5s DESC LIMIT 20;
```

## Cost estimate

| Incident | TPU | Wasted host-hours |
|---|---|---|
| A (2026-05-03 ~01:21 UTC) | v5p-64 | ~1.3 |
| B (2026-05-03 ~01:38 UTC, ran 9.5h thrash) | v5p-256 | ~608 |
| C (2026-05-03 ~01:41 UTC, mid-run failure) | v5p-256 | tens |
| D (2026-05-04 12:19 UTC dispatch) | v5p-128 | ~11 |

**Combined ~1100+ v5p host-hours** plus ~1 day of operator time on misdiagnosis (initially blamed on external priority-band-2 contention from another tenant; the placement collision is the actual root cause).

## Related

- Internal postmortem: `.agents/ops/iris_placement_bug.md` in the midtraining worktree (this issue is lifted from that postmortem).
- #5374 — MirrorFS / cross-region transfer issue (adjacent, same training jobs, unrelated root cause).
- Resume-namespace bug compounded incident A: see `.agents/ops/2026-05-02-delphi-midtrain-resume-namespace.md`.

---

# Addendum: affected job IDs

All jobs are under `/ahmedah/` on the `marin` iris cluster.

### Incident A — v5p-64, 2026-05-03 ~01:21 UTC

Two simultaneous submissions, both failed with `preemptions=707, worker_failed=7/8` each:

- `delphi-1e21-p67m33-lr0p5-batch64-resume2-20260503-012053`
- `delphi-1e21-p67m33-lr0p67-batch64-resume2-20260503-012053`

Submission times within seconds of each other; bootstrap logs showed 8/8 task indices on identical hosts. Compounded by `/moojink/` priority-band-2 batch contention on the same TPU pool, which contributed external preemption pressure but is **not** the root cause.

Earlier failed v5p-64 runs from the same session that almost certainly hit the same bug class:
- `delphi-1e21-p67m33-lr0p5-batch64-resume-20260502-151756`
- `delphi-1e21-p67m33-lr0p67-batch64-resume-20260502-161841`
- 6 v5p-64 batch-band-3 jobs at `20260503-014225`

### Incident B — v5p-256, 2026-05-03 submitted ~01:38 UTC, ran 09:14–18:43 UTC

Three p50m50 jobs in a tight ~3.6s submission window. Oldest survived; younger two collided.

Submission timestamps (epoch ms):
```
lr0.33 submitted: ~1777772314000  (oldest, +0s)
lr0.5  submitted: 1777772316424   (+2.4s)
lr0.67 submitted: 1777772318732   (+4.7s)
```

`train_lm` dispatch:
```
lr0.5  train_lm started: 1777799671408  (09:14:31 UTC)
lr0.67 train_lm started: 1777799674124  (09:14:34 UTC)  → +2.7s gap
```

End-state metrics on the two failed jobs (failed mid-run after 9.5h of clean uptime):

| Metric | lr0.5 | lr0.67 |
|---|---|---|
| state | failed | failed |
| failures | 1 | 1 |
| preemptions | 3131 | 3131 |
| worker_failed | 31/32 | 31/32 |
| `train_lm` wall lifetime | 569 min | 569 min |

Bootstrap log diff: 31/31 hosts identical between lr0.5 and lr0.67.

Failed:
- `delphi-1e22-p50m50-lr0p5-batch256-20260503-013817`
- `delphi-1e22-p50m50-lr0p67-batch256-20260503-013817`

Survived (control, ahead by 2.3s):
- `delphi-1e22-p50m50-lr0p33-batch256-20260503-013817` — `preemptions=1, worker_failed=0`

Other v5p-256 1e22 jobs in the same general fleshing-out batch — all submitted with ≥1 minute gaps from each other, all running fine — confirming the bug is specific to the simultaneous-dispatch window:
- `delphi-1e22-p50m50-lr0p83-batch256-20260503-013949`
- `delphi-1e22-p67m33-lr0p83-batch256-20260503-014108`
- `delphi-1e22-p33m67-lr0p83-batch256-20260503-014108` (this one later failed — see incident C)

### Incident C — v5p-256, 2026-05-03 ~01:41 UTC submission

Two jobs submitted within ~2.6 seconds. One survived; the other collided with the survivor (or with the lr0.5/lr0.67 jobs from incident B which were still in their thrash phase — exact pairing unclear).

Submission timestamps (epoch ms):
```
p67m33-lr0.83 submitted: 1777772482016  (control, +0s)
p33m67-lr0.83 submitted: 1777772484621  (+2.6s)
```

End-state metrics:

| Metric | p33m67-lr0.83 (failed) | p67m33-lr0.83 (control) |
|---|---|---|
| state | failed | succeeded |
| failures | 1 | 0 |
| preemptions | 3131 | (low) |
| worker_failed | 31/32 | 0/32 |

Identical `preemptions=3131, worker_failed=31` to incident B's lr0.5/lr0.67 case.

Failed:
- `delphi-1e22-p33m67-lr0p83-batch256-20260503-014108` (last permanent checkpoint at `step-6112` in `gs://marin-us-east5/checkpoints/delphi-1e22-p33m67-32p07b-lr0.83-78fd44/checkpoints/`; salvageable via resume)

Survived (control):
- `delphi-1e22-p67m33-lr0p83-batch256-20260503-014108`

### Incident D — v5p-128, dispatch-time collision (2026-05-04 12:19 UTC)

**The dispatch-time-collision variant.** Two jobs submitted **3h47m apart** but iris dispatched both within ~5s of each other when v5p-128 capacity finally opened.

Submission timestamps (epoch ms):
```
p67m33-lr0.67 resume2 submitted: 1777824878720  (16:14:38 UTC May 3)
p33m67-lr0.83          submitted: 1777838540690  (20:02:20 UTC May 3)
                                                  → 3h 47m gap at submission
```

`train_lm` dispatch:
```
p67m33-lr0.67 train_lm started: 1777897156854  (12:19:16 UTC May 4)
p33m67-lr0.83 train_lm started: 1777897161741  (12:19:21 UTC May 4)
                                                → 4.9s gap at dispatch
```

End-state metrics:

| Metric | p67m33-lr0.67 resume2 | p33m67-lr0.83 |
|---|---|---|
| state | failed | failed |
| failures | 1 | 1 |
| preemptions | 1515 | 1515 |
| worker_failed | 15/16 | 15/16 |
| `train_lm` wall lifetime | 21 min | 37 min |

The 1515 number is consistent with a 16-task gang on v5p-128 going through ~95 thrash cycles per task — same ratio as v5p-64 (707/8≈88) and v5p-256 (3131/32≈98).

Failed:
- `delphi-1e21-p67m33-lr0p67-batch128-resume2-20260503-161414` (was attempting to resume from `gs://marin-us-east5/checkpoints/delphi-1e21-p67m33-9p25b-lr0.67-ecbd27/`, latest perm step-2646)
- `delphi-1e21-p33m67-lr0p83-batch128-20260503-200203` (no checkpoint saved; would need fresh run from base — namespace `gs://marin-us-east5/checkpoints/delphi-1e21-p33m67-9p25b-lr0.83-0cb048/` with empty `checkpoints/`)

🤖 Filed by Claude Code on behalf of @ahmeda14960 . Source: `.agents/ops/iris_placement_bug.md` in the `midtrain_data` worktree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[iris] Coscheduler places concurrently-dispatched gangs on identical host sets → JAX coordinator port death loop #5470

TL;DR

Failure mechanics — why identical placement is fatal

Reproducer signature for future detection

Evidence — bootstrap log host comparison

Why a third co-submitted job survived (incident B)

Mitigations / workarounds

Action items / open questions

Database schema pointers (controller SQLite)

Useful diagnostic queries

Cost estimate

Related

Addendum: affected job IDs

Incident A — v5p-64, 2026-05-03 ~01:21 UTC

Incident B — v5p-256, 2026-05-03 submitted ~01:38 UTC, ran 09:14–18:43 UTC

Incident C — v5p-256, 2026-05-03 ~01:41 UTC submission

Incident D — v5p-128, dispatch-time collision (2026-05-04 12:19 UTC)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incident	TPU	Wasted host-hours
A (2026-05-03 ~01:21 UTC)	v5p-64	~1.3
B (2026-05-03 ~01:38 UTC, ran 9.5h thrash)	v5p-256	~608
C (2026-05-03 ~01:41 UTC, mid-run failure)	v5p-256	tens
D (2026-05-04 12:19 UTC dispatch)	v5p-128	~11

Metric	lr0.5	lr0.67
state	failed	failed
failures	1	1
preemptions	3131	3131
worker_failed	31/32	31/32
`train_lm` wall lifetime	569 min	569 min

Metric	p33m67-lr0.83 (failed)	p67m33-lr0.83 (control)
state	failed	succeeded
failures	1	0
preemptions	3131	(low)
worker_failed	31/32	0/32

Uh oh!

[iris] Coscheduler places concurrently-dispatched gangs on identical host sets → JAX coordinator port death loop #5470

Description

TL;DR

Failure mechanics — why identical placement is fatal

Reproducer signature for future detection

Evidence — bootstrap log host comparison

Why a third co-submitted job survived (incident B)

Mitigations / workarounds

Action items / open questions

Database schema pointers (controller SQLite)

Useful diagnostic queries

Cost estimate

Related

Addendum: affected job IDs

Incident A — v5p-64, 2026-05-03 ~01:21 UTC

Incident B — v5p-256, 2026-05-03 submitted ~01:38 UTC, ran 09:14–18:43 UTC

Incident C — v5p-256, 2026-05-03 ~01:41 UTC submission

Incident D — v5p-128, dispatch-time collision (2026-05-04 12:19 UTC)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions