fix(perf): retune Gemma1 decode#24
Merged
Merged
Conversation
Switch Gemma v1 quantized GeGLU to the same tanh-approx GELU path used by mlx-lm for gelu_pytorch_tanh configs, avoid scalar full-array promotion in embedding scaling, and enable the same layer pipeline hint / maskless padded prefill hooks used by the newer Gemma paths. Measured gemma-2b-4bit on M5 Max: baseline chat-template decode 69.34 tok/s, patched final decode 214.49 tok/s, and mlx-lm baseline 223.27 tok/s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings the older Gemma v1 hot path up to the same dtype-discipline + activation-shape level as the newer Gemma versions. On
gemma-2b-4bit(M5 Max), decode throughput jumps from 69.34 → 214.49 tok/s — closing most of the gap to the mlx-lm baseline of 223.27 tok/s.Cherry-picked from
mlxcel-internalcommit1b4937a8. The internalbenchmarks/,docs/model_tests_m5max.md, anddocs_internal/performance/...paths from the original commit are intentionally excluded (none of them exist in this repo — same pattern as #20, #22).What changed
src/models/gemma.rs— three coordinated fixes:tanh-approx GELU path that mlx-lm picks forgelu_pytorch_tanhconfigs, instead of the exact-erf GELU. Bit-equivalent to the upstream reference.gemma2/gemma3.src/lib/mlxcel-core/src/utils.rs— 1-line supporting helper change to make the dtype-preserving scalar mul reusable.Verification
make verify-fmt— cleanmake verify-clippy(CI-faithful:--all-targets --features metal,accelerate -- -D warnings) — clean in 14s (warm cache)make verify-testskipped (15-30 min release run); the upstream commit is already validated against the M5 Max sweep with the throughput numbers above.