Runner churn deadlock: 16 runners stuck (9 ShuttingDown / 6 Ready / 1 Connected), 98 CreateRunner tasks pending, zero progress over 2+ minutes

## Environment

- 4x Mac Studio (Apple M3 Ultra, 256GB unified memory each)
- macOS 26.3 (Build 25D125)
- EXO.app version **1.0.70** (CFBundleVersion 1000070999)
- Network: 192.168.5.0/24 (Studio-AI-Yellow/Purple/Green/Red)
- Thunderbolt RDMA mesh (4 of 6 edges connected; Purple↔Red cable missing — this has been stable for other models)

## Problem

Cluster entered a runner-churn deadlock state for a single `MlxJacclInstance` (Qwen3-Coder-480B-A35B-Instruct-4bit). For a 4-node cluster serving 1 instance we expect ~4 runners; instead we have 16 runners layered up and not converging.

Inference requests to `/v1/chat/completions` submitted against this instance hang indefinitely with no response body.

## State snapshot (sampled every 30s for 2 minutes, identical each sample)

```
inst=1 runners=16 tasks=98 | RunnerConnected=1 RunnerReady=6 RunnerShuttingDown=9
```

- **Instance**: `MlxJacclInstance` id `7519b4af-9c05-4c35-8800-b98579350341`, model `mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit`
- **Runners**: 16 total on this single instance
  - 9 in `RunnerShuttingDown` (frozen — none completed teardown)
  - 6 in `RunnerReady` (cannot serve alone — shards incomplete)
  - 1 in `RunnerConnected`
- **Tasks**: 98 pending, all of type `CreateRunner` (new replacements queued behind the shutdowns)
- **Zero state transitions** across 5 samples spanning 2+ minutes — this is a hard deadlock, not an in-progress rebalance

## What's NOT happening

- No crash reports in `~/Library/Logs/DiagnosticReports/` on any node
- All 4 peers reachable via ping (link-local + LAN)
- EXO.app main process + fork worker alive on every node with normal memory footprint (24–26% of 256GB, consistent with 4-way shard of the 275GB 4bit model)
- Fork workers are NOT CPU-spinning (earlier JACCL all_reduce spin-deadlock pattern from 2026-02-10 showed 100% CPU on fork workers in `tbt_poll_cq`; this is different — workers are idle)

## Suspected cause

Something queued a mass `CreateRunner` wave while a prior runner generation was mid-teardown. The replacements are blocked waiting for shutdown completion, and the shutdowns appear to be blocked waiting for a coordinator that's part of the new generation. Deadlock pinned.

The 6 `Ready` runners cannot serve requests because a 480B model sharded across 4 nodes needs the full coordinated runner set; it can't degrade gracefully to a partial set.

## EXO.app log visibility caveat

EXO.app fork workers have stdout/stderr = /dev/null — Python tracebacks, if any, are invisible to `log show` and to `ps`. We have no stack trace from the shutting-down runners. Would be helpful if EXO.app redirected worker stdio to a per-pid log file under `~/Library/Logs/EXO/` so these states were diagnosable.

## Reproduction

Load Qwen3-Coder-480B-A35B-Instruct-4bit on a 4-node M3 Ultra cluster. Submit inference requests via a chat client (OpenCode TUI in our case). Intermittent — this particular instance has been running for a while and previous requests succeeded.

## Workaround

Rebooting all 4 nodes clears the state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner churn deadlock: 16 runners stuck (9 ShuttingDown / 6 Ready / 1 Connected), 98 CreateRunner tasks pending, zero progress over 2+ minutes #1934

Environment

Problem

State snapshot (sampled every 30s for 2 minutes, identical each sample)

What's NOT happening

Suspected cause

EXO.app log visibility caveat

Reproduction

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runner churn deadlock: 16 runners stuck (9 ShuttingDown / 6 Ready / 1 Connected), 98 CreateRunner tasks pending, zero progress over 2+ minutes #1934

Description

Environment

Problem

State snapshot (sampled every 30s for 2 minutes, identical each sample)

What's NOT happening

Suspected cause

EXO.app log visibility caveat

Reproduction

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions