Skip to content

Runner churn deadlock: 16 runners stuck (9 ShuttingDown / 6 Ready / 1 Connected), 98 CreateRunner tasks pending, zero progress over 2+ minutes #1934

@lucasajackson

Description

@lucasajackson

Environment

  • 4x Mac Studio (Apple M3 Ultra, 256GB unified memory each)
  • macOS 26.3 (Build 25D125)
  • EXO.app version 1.0.70 (CFBundleVersion 1000070999)
  • Network: 192.168.5.0/24 (Studio-AI-Yellow/Purple/Green/Red)
  • Thunderbolt RDMA mesh (4 of 6 edges connected; Purple↔Red cable missing — this has been stable for other models)

Problem

Cluster entered a runner-churn deadlock state for a single MlxJacclInstance (Qwen3-Coder-480B-A35B-Instruct-4bit). For a 4-node cluster serving 1 instance we expect ~4 runners; instead we have 16 runners layered up and not converging.

Inference requests to /v1/chat/completions submitted against this instance hang indefinitely with no response body.

State snapshot (sampled every 30s for 2 minutes, identical each sample)

inst=1 runners=16 tasks=98 | RunnerConnected=1 RunnerReady=6 RunnerShuttingDown=9
  • Instance: MlxJacclInstance id 7519b4af-9c05-4c35-8800-b98579350341, model mlx-community/Qwen3-Coder-480B-A35B-Instruct-4bit
  • Runners: 16 total on this single instance
    • 9 in RunnerShuttingDown (frozen — none completed teardown)
    • 6 in RunnerReady (cannot serve alone — shards incomplete)
    • 1 in RunnerConnected
  • Tasks: 98 pending, all of type CreateRunner (new replacements queued behind the shutdowns)
  • Zero state transitions across 5 samples spanning 2+ minutes — this is a hard deadlock, not an in-progress rebalance

What's NOT happening

  • No crash reports in ~/Library/Logs/DiagnosticReports/ on any node
  • All 4 peers reachable via ping (link-local + LAN)
  • EXO.app main process + fork worker alive on every node with normal memory footprint (24–26% of 256GB, consistent with 4-way shard of the 275GB 4bit model)
  • Fork workers are NOT CPU-spinning (earlier JACCL all_reduce spin-deadlock pattern from 2026-02-10 showed 100% CPU on fork workers in tbt_poll_cq; this is different — workers are idle)

Suspected cause

Something queued a mass CreateRunner wave while a prior runner generation was mid-teardown. The replacements are blocked waiting for shutdown completion, and the shutdowns appear to be blocked waiting for a coordinator that's part of the new generation. Deadlock pinned.

The 6 Ready runners cannot serve requests because a 480B model sharded across 4 nodes needs the full coordinated runner set; it can't degrade gracefully to a partial set.

EXO.app log visibility caveat

EXO.app fork workers have stdout/stderr = /dev/null — Python tracebacks, if any, are invisible to log show and to ps. We have no stack trace from the shutting-down runners. Would be helpful if EXO.app redirected worker stdio to a per-pid log file under ~/Library/Logs/EXO/ so these states were diagnosable.

Reproduction

Load Qwen3-Coder-480B-A35B-Instruct-4bit on a 4-node M3 Ultra cluster. Submit inference requests via a chat client (OpenCode TUI in our case). Intermittent — this particular instance has been running for a while and previous requests succeeded.

Workaround

Rebooting all 4 nodes clears the state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions