Environment
- 4x Mac Studio (Apple M3 Ultra, 256GB unified memory each)
- macOS 26.3 (Build 25D125)
- EXO.app version 1.0.70 (CFBundleVersion 1000070999)
- Network: 192.168.5.0/24 (Studio-AI-Yellow/Purple/Green/Red)
- Thunderbolt RDMA mesh (4 of 6 edges connected; Purple↔Red cable missing — this has been stable for other models)
Problem
Cluster entered a runner-churn deadlock state for a single MlxJacclInstance (Qwen3-Coder-480B-A35B-Instruct-4bit). For a 4-node cluster serving 1 instance we expect ~4 runners; instead we have 16 runners layered up and not converging.
Inference requests to /v1/chat/completions submitted against this instance hang indefinitely with no response body.
State snapshot (sampled every 30s for 2 minutes, identical each sample)
inst=1 runners=16 tasks=98 | RunnerConnected=1 RunnerReady=6 RunnerShuttingDown=9
- Instance:
MlxJacclInstance id 7519b4af-9c05-4c35-8800-b98579350341, model mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit
- Runners: 16 total on this single instance
- 9 in
RunnerShuttingDown (frozen — none completed teardown)
- 6 in
RunnerReady (cannot serve alone — shards incomplete)
- 1 in
RunnerConnected
- Tasks: 98 pending, all of type
CreateRunner (new replacements queued behind the shutdowns)
- Zero state transitions across 5 samples spanning 2+ minutes — this is a hard deadlock, not an in-progress rebalance
What's NOT happening
- No crash reports in
~/Library/Logs/DiagnosticReports/ on any node
- All 4 peers reachable via ping (link-local + LAN)
- EXO.app main process + fork worker alive on every node with normal memory footprint (24–26% of 256GB, consistent with 4-way shard of the 275GB 4bit model)
- Fork workers are NOT CPU-spinning (earlier JACCL all_reduce spin-deadlock pattern from 2026-02-10 showed 100% CPU on fork workers in
tbt_poll_cq; this is different — workers are idle)
Suspected cause
Something queued a mass CreateRunner wave while a prior runner generation was mid-teardown. The replacements are blocked waiting for shutdown completion, and the shutdowns appear to be blocked waiting for a coordinator that's part of the new generation. Deadlock pinned.
The 6 Ready runners cannot serve requests because a 480B model sharded across 4 nodes needs the full coordinated runner set; it can't degrade gracefully to a partial set.
EXO.app log visibility caveat
EXO.app fork workers have stdout/stderr = /dev/null — Python tracebacks, if any, are invisible to log show and to ps. We have no stack trace from the shutting-down runners. Would be helpful if EXO.app redirected worker stdio to a per-pid log file under ~/Library/Logs/EXO/ so these states were diagnosable.
Reproduction
Load Qwen3-Coder-480B-A35B-Instruct-4bit on a 4-node M3 Ultra cluster. Submit inference requests via a chat client (OpenCode TUI in our case). Intermittent — this particular instance has been running for a while and previous requests succeeded.
Workaround
Rebooting all 4 nodes clears the state.
Environment
Problem
Cluster entered a runner-churn deadlock state for a single
MlxJacclInstance(Qwen3-Coder-480B-A35B-Instruct-4bit). For a 4-node cluster serving 1 instance we expect ~4 runners; instead we have 16 runners layered up and not converging.Inference requests to
/v1/chat/completionssubmitted against this instance hang indefinitely with no response body.State snapshot (sampled every 30s for 2 minutes, identical each sample)
MlxJacclInstanceid7519b4af-9c05-4c35-8800-b98579350341, modelmlx-community/Qwen3-Coder-480B-A35B-Instruct-4bitRunnerShuttingDown(frozen — none completed teardown)RunnerReady(cannot serve alone — shards incomplete)RunnerConnectedCreateRunner(new replacements queued behind the shutdowns)What's NOT happening
~/Library/Logs/DiagnosticReports/on any nodetbt_poll_cq; this is different — workers are idle)Suspected cause
Something queued a mass
CreateRunnerwave while a prior runner generation was mid-teardown. The replacements are blocked waiting for shutdown completion, and the shutdowns appear to be blocked waiting for a coordinator that's part of the new generation. Deadlock pinned.The 6
Readyrunners cannot serve requests because a 480B model sharded across 4 nodes needs the full coordinated runner set; it can't degrade gracefully to a partial set.EXO.app log visibility caveat
EXO.app fork workers have stdout/stderr = /dev/null — Python tracebacks, if any, are invisible to
log showand tops. We have no stack trace from the shutting-down runners. Would be helpful if EXO.app redirected worker stdio to a per-pid log file under~/Library/Logs/EXO/so these states were diagnosable.Reproduction
Load Qwen3-Coder-480B-A35B-Instruct-4bit on a 4-node M3 Ultra cluster. Submit inference requests via a chat client (OpenCode TUI in our case). Intermittent — this particular instance has been running for a while and previous requests succeeded.
Workaround
Rebooting all 4 nodes clears the state.