[BUG] : place_instance silently fails for `qwen3_moe` architecture (Qwen3-30B-A3B-4bit)

On a single-node macOS deployment, calling `POST /place_instance` for `mlx-community/Qwen3-30B-A3B-4bit` (architecture `Qwen3MoeForCausalLM`, `model_type: qwen3_moe`) never results in a running instance. The command is accepted with a `command_id`, but one of two failure modes occurs:

- **Scenario A** (when a prior instance was placed): 4 tasks are created — `CreateRunner`, `LoadModel`, `StartWarmup` all report `taskStatus: Complete`, then a `Shutdown` task appears with `taskStatus: Running` and stays stuck there indefinitely. The runner state becomes `RunnerShuttingDown`. No instance ever registers. `/v1/chat/completions` returns 404 `No instance found for model...`.
- **Scenario B** (fresh exo state after clean restart + `event_log` cleared): **zero tasks** are generated. The planner appears to silently drop the command. Runner count stays 0, instance count stays 0.

Identical workflow works correctly for `mlx-community/Qwen3.5-9B-8bit` (dense `Qwen3_5ForConditionalGeneration`) on the same hardware.

## Environment

| Item | Value |
|---|---|
| EXO.app version | `1.0.70` (build `1000070999`) |
| exo API version | `exo v1.0` (from `/ollama/api/version`) |
| Install path | `/Applications/EXO.app` (standard macOS app bundle) |
| Mac | MacBook Pro `Mac16,8` (M4 Pro, 12-core CPU) |
| RAM | 24 GB unified memory |
| macOS | 26.4.1 (build 25E253), Darwin 25.4.0 arm64 |
| `iogpu.wired_limit_mb` | 23552 (bumped from default ~16 GB) |
| Nodes | single-node (no cluster, no Thunderbolt peers) |

## The model

| Field | Value |
|---|---|
| Model id | `mlx-community/Qwen3-30B-A3B-4bit` |
| Quantization | 4-bit |
| Architecture | `Qwen3MoeForCausalLM` |
| model_type | `qwen3_moe` |
| Layers | 48 |
| Hidden size | 2048 |
| Experts | 128 total / 8 active per token |
| Context length | 32768 |
| On-disk size | 16,797 MB (17.6 GB reported by model card) |
| Files on disk | 4 safetensors shards (5.0 GB + 5.0 GB + 4.9 GB + 1.1 GB), `model.safetensors.index.json`, `chat_template.jinja`, `tokenizer.json`, etc. No `.partial` or `.incomplete` files. |
| 1,351 tensors | confirmed via `model.safetensors.index.json` weight_map |

Downloaded via `huggingface_hub.snapshot_download` to `~/.exo/models/mlx-community--Qwen3-30B-A3B-4bit/`. File integrity verified.

## exo's architecture registration

`Qwen3MoeForCausalLM` IS registered in the bundled `model_cards.py`:

```python
# /Applications/EXO.app/Contents/Resources/exo/_internal/exo/shared/models/model_cards.py:260
@property
def supports_tensor(self) -> bool:
    return self.architectures in [
        ["Glm4MoeLiteForCausalLM"],
        ["GlmMoeDsaForCausalLM"],
        ["DeepseekV32ForCausalLM"],
        ["DeepseekV3ForCausalLM"],
        ["Qwen3NextForCausalLM"],
        ["Qwen3MoeForCausalLM"],          # ← present
        ["Qwen3_5MoeForConditionalGeneration"],
        ...
    ]
```

The model card lists `capabilities: [text, thinking, thinking_toggle]` and `supports_tensor: true`, so exo considers the model valid and launch-ready at the catalog level.

## Memory math (relevant for KV cache fit at 32K context)

| Context | Weights | KV cache | Total | Fits 20 GB cap? | Fits 23 GB cap? |
|---|---|---|---|---|---|
| 2K | 17.61 GB | 0.20 GB | 17.81 GB | ✓ | ✓ |
| 8K | 17.61 GB | 0.81 GB | 18.42 GB | ✓ | ✓ |
| 16K | 17.61 GB | 1.61 GB | 19.22 GB | ✓ | ✓ |
| **32K (default)** | 17.61 GB | 3.22 GB | **20.83 GB** | ✗ | ✓ |

KV calc: `2 * 48 layers * 4 kv_heads * 128 head_dim * 2 bytes * context_tokens`.

We bumped `iogpu.wired_limit_mb` from 20480 to 23552 for this reason. At 23 GB cap there's ~2 GB headroom; memory pressure should not be a factor.

## Reproduction

1. Download the model (exo doesn't bundle it):
   ```bash
   python3 -c "from huggingface_hub import snapshot_download; snapshot_download('mlx-community/Qwen3-30B-A3B-4bit', local_dir='/Users/you/.exo/models/mlx-community--Qwen3-30B-A3B-4bit')"
   ```
2. Bump Metal wired memory cap so 32K context fits:
   ```bash
   sudo sysctl iogpu.wired_limit_mb=23552
   ```
3. Clean-restart exo to ensure no stale state:
   ```bash
   osascript -e 'quit app "EXO"'
   pkill -9 -f "EXO.app" ; pkill -9 -f "exo/_internal"
   mv ~/.exo/event_log ~/.exo/event_log.bak-$(date +%s)
   open -a EXO
   ```
4. Wait for API up (`curl http://localhost:52415/v1/models`).
5. Place the instance:
   ```bash
   curl -X POST http://localhost:52415/place_instance \
     -H 'Content-Type: application/json' \
     -d '{"model_id":"mlx-community/Qwen3-30B-A3B-4bit","sharding":"Pipeline","instance_meta":"MlxRing","min_nodes":1}'
   ```
   Response: `{"message":"Command received.","command_id":"...","model_card":{...}}` (200 OK)
6. Watch `/state` for 2 minutes. Observed: `instances: 0`, `tasks: 0`, `runners: 0` — no progress. `/v1/chat/completions` for the model returns 404 `No instance found`.

Repeat with `sharding: "Tensor"` — same result (0 tasks, silent no-op).

## Scenario A (the "almost works" case)

When Scenario B's clean-restart procedure is NOT followed, and instead a previous `Qwen3.5-9B-8bit` instance was placed and then deleted via `DELETE /instance/{id}`, a subsequent `place_instance` for the 30B sometimes produces:

```
tasks:
  CreateRunner   instanceId=bd631bea... runnerId=-        taskStatus=Complete
  LoadModel      instanceId=bd631bea... runnerId=-        taskStatus=Complete
  StartWarmup    instanceId=bd631bea... runnerId=-        taskStatus=Complete
  Shutdown       instanceId=bd631bea... runnerId=79efe89b taskStatus=Running   ← stuck
runners:
  79efe89b: {"RunnerShuttingDown": {}}
instances: 0
```

The `Shutdown` task never completes. The instance never registers. `/v1/chat/completions` continues to return 404. No error message appears in `log show --process EXO`, in `/events` SSE stream, or in `~/.exo/event_log`.

The same dense-model workflow on the same hardware:
```
tasks:
  CreateRunner   instanceId=9582933a... runnerId=56d13393 taskStatus=Complete
  LoadModel      instanceId=9582933a... taskStatus=Complete
  StartWarmup    instanceId=9582933a... taskStatus=Complete
instances: 1   ← registered, serves /v1/chat/completions in <10 s
```

## Expected behavior

`place_instance` for `Qwen3-30B-A3B-4bit` should either:
1. Produce `CreateRunner → LoadModel → StartWarmup → [register instance]` with the instance appearing in `/state.instances` and serving requests, **or**
2. Produce an explicit failure event (with a descriptive `error_message`) in `/events` or `log show` so operators know what's wrong.

## Actual behavior

Either zero tasks are generated (silent no-op) or the runner tears itself down after warmup (`Shutdown` task stuck `Running`, runner stuck `RunnerShuttingDown`) — with no surfaced error anywhere.

## Ruled out

- [x] Memory (bumped `iogpu.wired_limit_mb` to 23552; 32K full-context still fits with headroom)
- [x] Missing architecture (`Qwen3MoeForCausalLM` present in `model_cards.py`)
- [x] Corrupt weights (16 GB verified, 1,351 tensors in safetensors.index.json, no `.incomplete` files)
- [x] Wrong sharding mode (tried both `Pipeline` and `Tensor`)
- [x] Stale event log (cleared, restart, same behavior)
- [x] Stuck previous instance (clean kill + event_log move-aside, same behavior)
- [x] Network issues (localhost single-node, no peers)
- [x] HF rate limit / missing files (all shards present, download complete)

## Not yet verified

- [ ] Whether the dashboard UI Launch button uses a different code path that succeeds
- [ ] Whether the same bug affects other `qwen3_moe` models (e.g., `Qwen3-Coder-30B-A3B-Instruct-4bit`, `Qwen3-30B-A3B-Thinking-2507-4bit`)
- [ ] Whether starting a runner process manually with the right arguments bypasses the master
- [ ] Where exo writes Python stderr — none of `log show --process EXO`, `~/.exo/event_log`, `/events`, or stdout in the launcher surfaced any error message from the python runner during any of the attempts

## Diagnostic data available on request

- Full content of `/state` JSON at each second during a Scenario A reproduction
- Full `/events` SSE stream capture for both scenarios
- `/openapi.json` from this exo build
- `log show --process EXO --last 5m` (largely empty of relevant content)

## Impact for users

Qwen3 MoE models are a significant size/quality sweet spot for single-M4 Pro setups (30B capability at 3B-active compute cost). If the MoE placement path is broken, these models are effectively unusable through the exo API despite showing up as `capabilities: [text, thinking, thinking_toggle]` in the catalog. Users have to guess they won't work based on the silent failure.

## Suggested fixes regardless of root cause

1. **Surface task errors.** When a task transitions to `Shutdown` unexpectedly post-warmup, emit an `InstanceFailed` event with a reason code.
2. **Fail `place_instance` loudly** when the planner declines to generate tasks. Currently `place_instance` always returns "Command received" even for commands that will never execute.
3. **Mark known-broken architectures in the model card** instead of showing them as fully capable. Or add a dry-run preflight endpoint (`POST /place_instance?dry_run=true`) that surfaces what would be attempted.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] : place_instance silently fails for `qwen3_moe` architecture (Qwen3-30B-A3B-4bit) #1937

Environment

The model

exo's architecture registration

Memory math (relevant for KV cache fit at 32K context)

Reproduction

Scenario A (the "almost works" case)

Expected behavior

Actual behavior

Ruled out

Not yet verified

Diagnostic data available on request

Impact for users

Suggested fixes regardless of root cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Item	Value
EXO.app version	`1.0.70` (build `1000070999`)
exo API version	`exo v1.0` (from `/ollama/api/version`)
Install path	`/Applications/EXO.app` (standard macOS app bundle)
Mac	MacBook Pro `Mac16,8` (M4 Pro, 12-core CPU)
RAM	24 GB unified memory
macOS	26.4.1 (build 25E253), Darwin 25.4.0 arm64
`iogpu.wired_limit_mb`	23552 (bumped from default ~16 GB)
Nodes	single-node (no cluster, no Thunderbolt peers)

Field	Value
Model id	`mlx-community/Qwen3-30B-A3B-4bit`
Quantization	4-bit
Architecture	`Qwen3MoeForCausalLM`
model_type	`qwen3_moe`
Layers	48
Hidden size	2048
Experts	128 total / 8 active per token
Context length	32768
On-disk size	16,797 MB (17.6 GB reported by model card)
Files on disk	4 safetensors shards (5.0 GB + 5.0 GB + 4.9 GB + 1.1 GB), `model.safetensors.index.json`, `chat_template.jinja`, `tokenizer.json`, etc. No `.partial` or `.incomplete` files.
1,351 tensors	confirmed via `model.safetensors.index.json` weight_map

Context	Weights	KV cache	Total	Fits 20 GB cap?	Fits 23 GB cap?
2K	17.61 GB	0.20 GB	17.81 GB	✓	✓
8K	17.61 GB	0.81 GB	18.42 GB	✓	✓
16K	17.61 GB	1.61 GB	19.22 GB	✓	✓
32K (default)	17.61 GB	3.22 GB	20.83 GB	✗	✓

[BUG] : place_instance silently fails for qwen3_moe architecture (Qwen3-30B-A3B-4bit) #1937

Description

Environment

The model

exo's architecture registration

Memory math (relevant for KV cache fit at 32K context)

Reproduction

Scenario A (the "almost works" case)

Expected behavior

Actual behavior

Ruled out

Not yet verified

Diagnostic data available on request

Impact for users

Suggested fixes regardless of root cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] : place_instance silently fails for `qwen3_moe` architecture (Qwen3-30B-A3B-4bit) #1937