Commit 4c7301d
authored
feat(moe): add stacked-buffer fast path for SSD streaming (#35)
Introduces `MLX_MOE_STACKED` and `MLX_MOE_FUSE_GATEUP` fast paths to SwitchGLU for SSD-streamed MoE inference. Replaces multiple per-expert kernel dispatches with a single batched gatherQuantizedMM per projection, drastically reducing CPU→GPU enqueue overhead on Apple Silicon.
- Defaults to legacy behavior unless env flags are set
- Automatically and safely falls back if the layer is ineligible (e.g. non-quantized weights, or batch size > 32)
- Added unit tests to ensure fallback safety1 parent 40d6b67 commit 4c7301d
4 files changed
Lines changed: 553 additions & 0 deletions
File tree
- .github/workflows
- Libraries/MLXLMCommon
- Tests/MLXLMTests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
123 | 123 | | |
124 | 124 | | |
125 | 125 | | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
126 | 132 | | |
127 | 133 | | |
128 | 134 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
36 | 44 | | |
37 | 45 | | |
38 | 46 | | |
| |||
0 commit comments