Skip to content

[BUG] : place_instance silently fails for qwen3_moe architecture (Qwen3-30B-A3B-4bit) #1937

@nhwaani

Description

@nhwaani

On a single-node macOS deployment, calling POST /place_instance for mlx-community/Qwen3-30B-A3B-4bit (architecture Qwen3MoeForCausalLM, model_type: qwen3_moe) never results in a running instance. The command is accepted with a command_id, but one of two failure modes occurs:

  • Scenario A (when a prior instance was placed): 4 tasks are created — CreateRunner, LoadModel, StartWarmup all report taskStatus: Complete, then a Shutdown task appears with taskStatus: Running and stays stuck there indefinitely. The runner state becomes RunnerShuttingDown. No instance ever registers. /v1/chat/completions returns 404 No instance found for model....
  • Scenario B (fresh exo state after clean restart + event_log cleared): zero tasks are generated. The planner appears to silently drop the command. Runner count stays 0, instance count stays 0.

Identical workflow works correctly for mlx-community/Qwen3.5-9B-8bit (dense Qwen3_5ForConditionalGeneration) on the same hardware.

Environment

Item Value
EXO.app version 1.0.70 (build 1000070999)
exo API version exo v1.0 (from /ollama/api/version)
Install path /Applications/EXO.app (standard macOS app bundle)
Mac MacBook Pro Mac16,8 (M4 Pro, 12-core CPU)
RAM 24 GB unified memory
macOS 26.4.1 (build 25E253), Darwin 25.4.0 arm64
iogpu.wired_limit_mb 23552 (bumped from default ~16 GB)
Nodes single-node (no cluster, no Thunderbolt peers)

The model

Field Value
Model id mlx-community/Qwen3-30B-A3B-4bit
Quantization 4-bit
Architecture Qwen3MoeForCausalLM
model_type qwen3_moe
Layers 48
Hidden size 2048
Experts 128 total / 8 active per token
Context length 32768
On-disk size 16,797 MB (17.6 GB reported by model card)
Files on disk 4 safetensors shards (5.0 GB + 5.0 GB + 4.9 GB + 1.1 GB), model.safetensors.index.json, chat_template.jinja, tokenizer.json, etc. No .partial or .incomplete files.
1,351 tensors confirmed via model.safetensors.index.json weight_map

Downloaded via huggingface_hub.snapshot_download to ~/.exo/models/mlx-community--Qwen3-30B-A3B-4bit/. File integrity verified.

exo's architecture registration

Qwen3MoeForCausalLM IS registered in the bundled model_cards.py:

# /Applications/EXO.app/Contents/Resources/exo/_internal/exo/shared/models/model_cards.py:260
@property
def supports_tensor(self) -> bool:
    return self.architectures in [
        ["Glm4MoeLiteForCausalLM"],
        ["GlmMoeDsaForCausalLM"],
        ["DeepseekV32ForCausalLM"],
        ["DeepseekV3ForCausalLM"],
        ["Qwen3NextForCausalLM"],
        ["Qwen3MoeForCausalLM"],          # ← present
        ["Qwen3_5MoeForConditionalGeneration"],
        ...
    ]

The model card lists capabilities: [text, thinking, thinking_toggle] and supports_tensor: true, so exo considers the model valid and launch-ready at the catalog level.

Memory math (relevant for KV cache fit at 32K context)

Context Weights KV cache Total Fits 20 GB cap? Fits 23 GB cap?
2K 17.61 GB 0.20 GB 17.81 GB
8K 17.61 GB 0.81 GB 18.42 GB
16K 17.61 GB 1.61 GB 19.22 GB
32K (default) 17.61 GB 3.22 GB 20.83 GB

KV calc: 2 * 48 layers * 4 kv_heads * 128 head_dim * 2 bytes * context_tokens.

We bumped iogpu.wired_limit_mb from 20480 to 23552 for this reason. At 23 GB cap there's ~2 GB headroom; memory pressure should not be a factor.

Reproduction

  1. Download the model (exo doesn't bundle it):
    python3 -c "from huggingface_hub import snapshot_download; snapshot_download('mlx-community/Qwen3-30B-A3B-4bit', local_dir='/Users/you/.exo/models/mlx-community--Qwen3-30B-A3B-4bit')"
  2. Bump Metal wired memory cap so 32K context fits:
    sudo sysctl iogpu.wired_limit_mb=23552
  3. Clean-restart exo to ensure no stale state:
    osascript -e 'quit app "EXO"'
    pkill -9 -f "EXO.app" ; pkill -9 -f "exo/_internal"
    mv ~/.exo/event_log ~/.exo/event_log.bak-$(date +%s)
    open -a EXO
  4. Wait for API up (curl http://localhost:52415/v1/models).
  5. Place the instance:
    curl -X POST http://localhost:52415/place_instance \
      -H 'Content-Type: application/json' \
      -d '{"model_id":"mlx-community/Qwen3-30B-A3B-4bit","sharding":"Pipeline","instance_meta":"MlxRing","min_nodes":1}'
    Response: {"message":"Command received.","command_id":"...","model_card":{...}} (200 OK)
  6. Watch /state for 2 minutes. Observed: instances: 0, tasks: 0, runners: 0 — no progress. /v1/chat/completions for the model returns 404 No instance found.

Repeat with sharding: "Tensor" — same result (0 tasks, silent no-op).

Scenario A (the "almost works" case)

When Scenario B's clean-restart procedure is NOT followed, and instead a previous Qwen3.5-9B-8bit instance was placed and then deleted via DELETE /instance/{id}, a subsequent place_instance for the 30B sometimes produces:

tasks:
  CreateRunner   instanceId=bd631bea... runnerId=-        taskStatus=Complete
  LoadModel      instanceId=bd631bea... runnerId=-        taskStatus=Complete
  StartWarmup    instanceId=bd631bea... runnerId=-        taskStatus=Complete
  Shutdown       instanceId=bd631bea... runnerId=79efe89b taskStatus=Running   ← stuck
runners:
  79efe89b: {"RunnerShuttingDown": {}}
instances: 0

The Shutdown task never completes. The instance never registers. /v1/chat/completions continues to return 404. No error message appears in log show --process EXO, in /events SSE stream, or in ~/.exo/event_log.

The same dense-model workflow on the same hardware:

tasks:
  CreateRunner   instanceId=9582933a... runnerId=56d13393 taskStatus=Complete
  LoadModel      instanceId=9582933a... taskStatus=Complete
  StartWarmup    instanceId=9582933a... taskStatus=Complete
instances: 1   ← registered, serves /v1/chat/completions in <10 s

Expected behavior

place_instance for Qwen3-30B-A3B-4bit should either:

  1. Produce CreateRunner → LoadModel → StartWarmup → [register instance] with the instance appearing in /state.instances and serving requests, or
  2. Produce an explicit failure event (with a descriptive error_message) in /events or log show so operators know what's wrong.

Actual behavior

Either zero tasks are generated (silent no-op) or the runner tears itself down after warmup (Shutdown task stuck Running, runner stuck RunnerShuttingDown) — with no surfaced error anywhere.

Ruled out

  • Memory (bumped iogpu.wired_limit_mb to 23552; 32K full-context still fits with headroom)
  • Missing architecture (Qwen3MoeForCausalLM present in model_cards.py)
  • Corrupt weights (16 GB verified, 1,351 tensors in safetensors.index.json, no .incomplete files)
  • Wrong sharding mode (tried both Pipeline and Tensor)
  • Stale event log (cleared, restart, same behavior)
  • Stuck previous instance (clean kill + event_log move-aside, same behavior)
  • Network issues (localhost single-node, no peers)
  • HF rate limit / missing files (all shards present, download complete)

Not yet verified

  • Whether the dashboard UI Launch button uses a different code path that succeeds
  • Whether the same bug affects other qwen3_moe models (e.g., Qwen3-Coder-30B-A3B-Instruct-4bit, Qwen3-30B-A3B-Thinking-2507-4bit)
  • Whether starting a runner process manually with the right arguments bypasses the master
  • Where exo writes Python stderr — none of log show --process EXO, ~/.exo/event_log, /events, or stdout in the launcher surfaced any error message from the python runner during any of the attempts

Diagnostic data available on request

  • Full content of /state JSON at each second during a Scenario A reproduction
  • Full /events SSE stream capture for both scenarios
  • /openapi.json from this exo build
  • log show --process EXO --last 5m (largely empty of relevant content)

Impact for users

Qwen3 MoE models are a significant size/quality sweet spot for single-M4 Pro setups (30B capability at 3B-active compute cost). If the MoE placement path is broken, these models are effectively unusable through the exo API despite showing up as capabilities: [text, thinking, thinking_toggle] in the catalog. Users have to guess they won't work based on the silent failure.

Suggested fixes regardless of root cause

  1. Surface task errors. When a task transitions to Shutdown unexpectedly post-warmup, emit an InstanceFailed event with a reason code.
  2. Fail place_instance loudly when the planner declines to generate tasks. Currently place_instance always returns "Command received" even for commands that will never execute.
  3. Mark known-broken architectures in the model card instead of showing them as fully capable. Or add a dry-run preflight endpoint (POST /place_instance?dry_run=true) that surfaces what would be attempted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions