Skip to content

fix(perf): keep gpt-oss decode in bf16#17

Open
inureyes wants to merge 1 commit into
mainfrom
fix/gpt-oss-decode-bf16
Open

fix(perf): keep gpt-oss decode in bf16#17
inureyes wants to merge 1 commit into
mainfrom
fix/gpt-oss-decode-bf16

Conversation

@inureyes
Copy link
Copy Markdown
Member

Summary

GptOss single-token decode was promoting activations to FP32 inside the expert MLP and router, breaking the BF16 fast path and causing a 5–6× throughput regression versus mlx-lm.

  • SwiGLU: mirrors the mlx-lm activation path with a compiled activation-only helper and casts the result back to the input dtype, preventing FP32 promotion through the expert down projection on single-token decode.
  • MoE router: now uses precise softmax and casts expert scores/results back to the expert/input dtype so residual state remains BF16 across all layers.

Impact

On M5 Max, gpt-oss-120b-4bit decode throughput:

Build tok/s
Before 19.49
After 112.83
mlx-lm baseline 110.35

mlxcel now slightly exceeds the mlx-lm Python baseline on this workload.

Files Touched

  • src/lib/mlxcel-core/cpp/mlx_cxx_bridge.{cpp,h} — new compiled activation helper
  • src/lib/mlxcel-core/src/lib.rs + ffi_tests.rs — Rust binding + FFI test
  • src/models/gpt_oss.rs — SwiGLU dtype preservation, router precise softmax + dtype casts

Test plan

  • cargo test -p mlxcel-core (FFI activation helper)
  • gpt-oss-120b-4bit decode benchmark on M5 Max — 112.83 tok/s (was 19.49)
  • Spot check other models still load and decode (smoke: qwen3-0.6b, llama3.1-8b)

GptOss SwiGLU now mirrors the mlx-lm activation path with a compiled
activation-only helper and casts the result back to the input dtype,
preventing FP32 promotion through the expert down projection on
single-token decode.

The MoE router now uses precise softmax and casts expert scores/results
back to the expert/input dtype so residual state remains BF16 across
layers. The M5 Max gpt-oss-120b-4bit benchmark improves from
19.49 tok/s to 112.83 tok/s, exceeding the 110.35 tok/s mlx-lm baseline.
@inureyes inureyes added type:bug Bug fixes, error corrections, or issue resolutions type:performance Performance improvements area:models Model architectures, weights, loading, metadata area:core mlxcel-core: MLX FFI, primitives, KV cache, layers status:review Under review priority:high High priority labels May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:models Model architectures, weights, loading, metadata priority:high High priority status:review Under review type:bug Bug fixes, error corrections, or issue resolutions type:performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant