Fuse dispatch for GEMV#717
Merged
Merged
Conversation
c82d2bc to
72f7a99
Compare
Collaborator
|
I think all the .py changes should be reverted and submitted as a separate PR. The kernel change looks good @ncylich |
Replace the legacy IL GEMV's three-pass dispatch (pool-parallel Hadamard + cv wait, serial int8 quantize, static parallel_ranges + cv wait) with a single fused dispatch: group-stolen phase A behind a spin barrier, dynamic 16-block-chunk stealing in phase B, main thread as worker 0 with a spin-join, thread budget ceil(chunks / CACTUS_GEMV_SB_PER_THREAD). The inner micro-kernel is unchanged and factored into cactus_quant_interleaved4_gemv_blocks. M4 kernel: kv_proj 44.7 -> 158.2 GF, o_proj 59.2 -> 188.4 GF. E2E decode on production bundles: gemma-4-e2b-it +19%, qwen3-1.7b +42%, lfm2-350m +26% (M4); Samsung 8-Elite +6%/+18%/+53%; Pixel Tensor G4 parity (DRAM roofline) with lfm2 +12-16%. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Collaborator
Author
|
@kar-m you're right, moved them into the upcoming SME PR that I will make next |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A single kernel change (2 files, no format or SME2 code, no Python): the legacy interleaved CQ4 GEMV's dispatch is fused into one pool pass. Every production bundle that exists today benefits as-is. #709 stacks on this.
What changes
cactus_quant_4bit_gemv_interleavedpreviously paid three passes per call: a pool-parallel Hadamard transform with a condition-variable join, a serial per-group int8 quantize on the main thread, then a second static-partition dispatch with another cv join. It now runs one fused dispatch via a sharedcactus_quant_two_phase_rundriver:ceil(chunks / CACTUS_GEMV_SB_PER_THREAD)(default 8)The 7-op inner micro-kernel (factored into
cactus_quant_interleaved4_gemv_blocks) and all numerics are unchanged. CQ4INTERLEAVED_4ROWonly — the production legacy format.Tests
IL fixture (exact inverse of the shipped decoder) vs the FP32 oracle through real dispatch, at N=192 (serial path) and N=4164 (1041 blocks → 66 chunks incl. a 1-block tail: multi-thread fused worker, phase-A stealing, 4-chunk grabs).
Measured
M4 Pro kernel level (Gemma shapes, alternating best-of-5, GF):
E2E vs a main-baseline engine on the same production legacy bundles (M4: ~1k-token prompt + 32 decode, alternating cycles, cold cycle dropped; re-verified on the final tree):
On-device (#698 harness, 512+32 spec, 3 alternating rounds, devices powered, Thermal Status 0):
Dispatch overhead is a fixed per-call cost, so the win scales with call rate: largest for small models on fast cores, parity (never a regression) where slow cores already sit at the DRAM roofline. The prefill gains come from the prompt tail that chunked prefill processes per-token at decode rate.