On a single-node macOS deployment, calling POST /place_instance for mlx-community/Qwen3-30B-A3B-4bit (architecture Qwen3MoeForCausalLM, model_type: qwen3_moe) never results in a running instance. The command is accepted with a command_id, but one of two failure modes occurs:
- Scenario A (when a prior instance was placed): 4 tasks are created —
CreateRunner, LoadModel, StartWarmup all report taskStatus: Complete, then a Shutdown task appears with taskStatus: Running and stays stuck there indefinitely. The runner state becomes RunnerShuttingDown. No instance ever registers. /v1/chat/completions returns 404 No instance found for model....
- Scenario B (fresh exo state after clean restart +
event_log cleared): zero tasks are generated. The planner appears to silently drop the command. Runner count stays 0, instance count stays 0.
Identical workflow works correctly for mlx-community/Qwen3.5-9B-8bit (dense Qwen3_5ForConditionalGeneration) on the same hardware.
Environment
| Item |
Value |
| EXO.app version |
1.0.70 (build 1000070999) |
| exo API version |
exo v1.0 (from /ollama/api/version) |
| Install path |
/Applications/EXO.app (standard macOS app bundle) |
| Mac |
MacBook Pro Mac16,8 (M4 Pro, 12-core CPU) |
| RAM |
24 GB unified memory |
| macOS |
26.4.1 (build 25E253), Darwin 25.4.0 arm64 |
iogpu.wired_limit_mb |
23552 (bumped from default ~16 GB) |
| Nodes |
single-node (no cluster, no Thunderbolt peers) |
The model
| Field |
Value |
| Model id |
mlx-community/Qwen3-30B-A3B-4bit |
| Quantization |
4-bit |
| Architecture |
Qwen3MoeForCausalLM |
| model_type |
qwen3_moe |
| Layers |
48 |
| Hidden size |
2048 |
| Experts |
128 total / 8 active per token |
| Context length |
32768 |
| On-disk size |
16,797 MB (17.6 GB reported by model card) |
| Files on disk |
4 safetensors shards (5.0 GB + 5.0 GB + 4.9 GB + 1.1 GB), model.safetensors.index.json, chat_template.jinja, tokenizer.json, etc. No .partial or .incomplete files. |
| 1,351 tensors |
confirmed via model.safetensors.index.json weight_map |
Downloaded via huggingface_hub.snapshot_download to ~/.exo/models/mlx-community--Qwen3-30B-A3B-4bit/. File integrity verified.
exo's architecture registration
Qwen3MoeForCausalLM IS registered in the bundled model_cards.py:
# /Applications/EXO.app/Contents/Resources/exo/_internal/exo/shared/models/model_cards.py:260
@property
def supports_tensor(self) -> bool:
return self.architectures in [
["Glm4MoeLiteForCausalLM"],
["GlmMoeDsaForCausalLM"],
["DeepseekV32ForCausalLM"],
["DeepseekV3ForCausalLM"],
["Qwen3NextForCausalLM"],
["Qwen3MoeForCausalLM"], # ← present
["Qwen3_5MoeForConditionalGeneration"],
...
]
The model card lists capabilities: [text, thinking, thinking_toggle] and supports_tensor: true, so exo considers the model valid and launch-ready at the catalog level.
Memory math (relevant for KV cache fit at 32K context)
| Context |
Weights |
KV cache |
Total |
Fits 20 GB cap? |
Fits 23 GB cap? |
| 2K |
17.61 GB |
0.20 GB |
17.81 GB |
✓ |
✓ |
| 8K |
17.61 GB |
0.81 GB |
18.42 GB |
✓ |
✓ |
| 16K |
17.61 GB |
1.61 GB |
19.22 GB |
✓ |
✓ |
| 32K (default) |
17.61 GB |
3.22 GB |
20.83 GB |
✗ |
✓ |
KV calc: 2 * 48 layers * 4 kv_heads * 128 head_dim * 2 bytes * context_tokens.
We bumped iogpu.wired_limit_mb from 20480 to 23552 for this reason. At 23 GB cap there's ~2 GB headroom; memory pressure should not be a factor.
Reproduction
- Download the model (exo doesn't bundle it):
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('mlx-community/Qwen3-30B-A3B-4bit', local_dir='/Users/you/.exo/models/mlx-community--Qwen3-30B-A3B-4bit')"
- Bump Metal wired memory cap so 32K context fits:
sudo sysctl iogpu.wired_limit_mb=23552
- Clean-restart exo to ensure no stale state:
osascript -e 'quit app "EXO"'
pkill -9 -f "EXO.app" ; pkill -9 -f "exo/_internal"
mv ~/.exo/event_log ~/.exo/event_log.bak-$(date +%s)
open -a EXO
- Wait for API up (
curl http://localhost:52415/v1/models).
- Place the instance:
curl -X POST http://localhost:52415/place_instance \
-H 'Content-Type: application/json' \
-d '{"model_id":"mlx-community/Qwen3-30B-A3B-4bit","sharding":"Pipeline","instance_meta":"MlxRing","min_nodes":1}'
Response: {"message":"Command received.","command_id":"...","model_card":{...}} (200 OK)
- Watch
/state for 2 minutes. Observed: instances: 0, tasks: 0, runners: 0 — no progress. /v1/chat/completions for the model returns 404 No instance found.
Repeat with sharding: "Tensor" — same result (0 tasks, silent no-op).
Scenario A (the "almost works" case)
When Scenario B's clean-restart procedure is NOT followed, and instead a previous Qwen3.5-9B-8bit instance was placed and then deleted via DELETE /instance/{id}, a subsequent place_instance for the 30B sometimes produces:
tasks:
CreateRunner instanceId=bd631bea... runnerId=- taskStatus=Complete
LoadModel instanceId=bd631bea... runnerId=- taskStatus=Complete
StartWarmup instanceId=bd631bea... runnerId=- taskStatus=Complete
Shutdown instanceId=bd631bea... runnerId=79efe89b taskStatus=Running ← stuck
runners:
79efe89b: {"RunnerShuttingDown": {}}
instances: 0
The Shutdown task never completes. The instance never registers. /v1/chat/completions continues to return 404. No error message appears in log show --process EXO, in /events SSE stream, or in ~/.exo/event_log.
The same dense-model workflow on the same hardware:
tasks:
CreateRunner instanceId=9582933a... runnerId=56d13393 taskStatus=Complete
LoadModel instanceId=9582933a... taskStatus=Complete
StartWarmup instanceId=9582933a... taskStatus=Complete
instances: 1 ← registered, serves /v1/chat/completions in <10 s
Expected behavior
place_instance for Qwen3-30B-A3B-4bit should either:
- Produce
CreateRunner → LoadModel → StartWarmup → [register instance] with the instance appearing in /state.instances and serving requests, or
- Produce an explicit failure event (with a descriptive
error_message) in /events or log show so operators know what's wrong.
Actual behavior
Either zero tasks are generated (silent no-op) or the runner tears itself down after warmup (Shutdown task stuck Running, runner stuck RunnerShuttingDown) — with no surfaced error anywhere.
Ruled out
Not yet verified
Diagnostic data available on request
- Full content of
/state JSON at each second during a Scenario A reproduction
- Full
/events SSE stream capture for both scenarios
/openapi.json from this exo build
log show --process EXO --last 5m (largely empty of relevant content)
Impact for users
Qwen3 MoE models are a significant size/quality sweet spot for single-M4 Pro setups (30B capability at 3B-active compute cost). If the MoE placement path is broken, these models are effectively unusable through the exo API despite showing up as capabilities: [text, thinking, thinking_toggle] in the catalog. Users have to guess they won't work based on the silent failure.
Suggested fixes regardless of root cause
- Surface task errors. When a task transitions to
Shutdown unexpectedly post-warmup, emit an InstanceFailed event with a reason code.
- Fail
place_instance loudly when the planner declines to generate tasks. Currently place_instance always returns "Command received" even for commands that will never execute.
- Mark known-broken architectures in the model card instead of showing them as fully capable. Or add a dry-run preflight endpoint (
POST /place_instance?dry_run=true) that surfaces what would be attempted.
On a single-node macOS deployment, calling
POST /place_instanceformlx-community/Qwen3-30B-A3B-4bit(architectureQwen3MoeForCausalLM,model_type: qwen3_moe) never results in a running instance. The command is accepted with acommand_id, but one of two failure modes occurs:CreateRunner,LoadModel,StartWarmupall reporttaskStatus: Complete, then aShutdowntask appears withtaskStatus: Runningand stays stuck there indefinitely. The runner state becomesRunnerShuttingDown. No instance ever registers./v1/chat/completionsreturns 404No instance found for model....event_logcleared): zero tasks are generated. The planner appears to silently drop the command. Runner count stays 0, instance count stays 0.Identical workflow works correctly for
mlx-community/Qwen3.5-9B-8bit(denseQwen3_5ForConditionalGeneration) on the same hardware.Environment
1.0.70(build1000070999)exo v1.0(from/ollama/api/version)/Applications/EXO.app(standard macOS app bundle)Mac16,8(M4 Pro, 12-core CPU)iogpu.wired_limit_mbThe model
mlx-community/Qwen3-30B-A3B-4bitQwen3MoeForCausalLMqwen3_moemodel.safetensors.index.json,chat_template.jinja,tokenizer.json, etc. No.partialor.incompletefiles.model.safetensors.index.jsonweight_mapDownloaded via
huggingface_hub.snapshot_downloadto~/.exo/models/mlx-community--Qwen3-30B-A3B-4bit/. File integrity verified.exo's architecture registration
Qwen3MoeForCausalLMIS registered in the bundledmodel_cards.py:The model card lists
capabilities: [text, thinking, thinking_toggle]andsupports_tensor: true, so exo considers the model valid and launch-ready at the catalog level.Memory math (relevant for KV cache fit at 32K context)
KV calc:
2 * 48 layers * 4 kv_heads * 128 head_dim * 2 bytes * context_tokens.We bumped
iogpu.wired_limit_mbfrom 20480 to 23552 for this reason. At 23 GB cap there's ~2 GB headroom; memory pressure should not be a factor.Reproduction
python3 -c "from huggingface_hub import snapshot_download; snapshot_download('mlx-community/Qwen3-30B-A3B-4bit', local_dir='/Users/you/.exo/models/mlx-community--Qwen3-30B-A3B-4bit')"curl http://localhost:52415/v1/models).{"message":"Command received.","command_id":"...","model_card":{...}}(200 OK)/statefor 2 minutes. Observed:instances: 0,tasks: 0,runners: 0— no progress./v1/chat/completionsfor the model returns 404No instance found.Repeat with
sharding: "Tensor"— same result (0 tasks, silent no-op).Scenario A (the "almost works" case)
When Scenario B's clean-restart procedure is NOT followed, and instead a previous
Qwen3.5-9B-8bitinstance was placed and then deleted viaDELETE /instance/{id}, a subsequentplace_instancefor the 30B sometimes produces:The
Shutdowntask never completes. The instance never registers./v1/chat/completionscontinues to return 404. No error message appears inlog show --process EXO, in/eventsSSE stream, or in~/.exo/event_log.The same dense-model workflow on the same hardware:
Expected behavior
place_instanceforQwen3-30B-A3B-4bitshould either:CreateRunner → LoadModel → StartWarmup → [register instance]with the instance appearing in/state.instancesand serving requests, orerror_message) in/eventsorlog showso operators know what's wrong.Actual behavior
Either zero tasks are generated (silent no-op) or the runner tears itself down after warmup (
Shutdowntask stuckRunning, runner stuckRunnerShuttingDown) — with no surfaced error anywhere.Ruled out
iogpu.wired_limit_mbto 23552; 32K full-context still fits with headroom)Qwen3MoeForCausalLMpresent inmodel_cards.py).incompletefiles)PipelineandTensor)Not yet verified
qwen3_moemodels (e.g.,Qwen3-Coder-30B-A3B-Instruct-4bit,Qwen3-30B-A3B-Thinking-2507-4bit)log show --process EXO,~/.exo/event_log,/events, or stdout in the launcher surfaced any error message from the python runner during any of the attemptsDiagnostic data available on request
/stateJSON at each second during a Scenario A reproduction/eventsSSE stream capture for both scenarios/openapi.jsonfrom this exo buildlog show --process EXO --last 5m(largely empty of relevant content)Impact for users
Qwen3 MoE models are a significant size/quality sweet spot for single-M4 Pro setups (30B capability at 3B-active compute cost). If the MoE placement path is broken, these models are effectively unusable through the exo API despite showing up as
capabilities: [text, thinking, thinking_toggle]in the catalog. Users have to guess they won't work based on the silent failure.Suggested fixes regardless of root cause
Shutdownunexpectedly post-warmup, emit anInstanceFailedevent with a reason code.place_instanceloudly when the planner declines to generate tasks. Currentlyplace_instancealways returns "Command received" even for commands that will never execute.POST /place_instance?dry_run=true) that surfaces what would be attempted.