Skip to content

Fuse dispatch for GEMV#717

Merged
jakmro merged 1 commit into
mainfrom
gemv-dispatch
Jun 12, 2026
Merged

Fuse dispatch for GEMV#717
jakmro merged 1 commit into
mainfrom
gemv-dispatch

Conversation

@ncylich

@ncylich ncylich commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

A single kernel change (2 files, no format or SME2 code, no Python): the legacy interleaved CQ4 GEMV's dispatch is fused into one pool pass. Every production bundle that exists today benefits as-is. #709 stacks on this.

What changes

cactus_quant_4bit_gemv_interleaved previously paid three passes per call: a pool-parallel Hadamard transform with a condition-variable join, a serial per-group int8 quantize on the main thread, then a second static-partition dispatch with another cv join. It now runs one fused dispatch via a shared cactus_quant_two_phase_run driver:

  • phase A (per-group Hadamard + int8 quantize) stolen by workers behind a spin barrier
  • phase B dynamically steals 16-block (64-channel) chunks
  • the main thread participates as worker 0 and spin-joins (a cv sleep costs ~5-10us per call, material at decode rates)
  • thread budget ceil(chunks / CACTUS_GEMV_SB_PER_THREAD) (default 8)

The 7-op inner micro-kernel (factored into cactus_quant_interleaved4_gemv_blocks) and all numerics are unchanged. CQ4 INTERLEAVED_4ROW only — the production legacy format.

Tests

IL fixture (exact inverse of the shipped decoder) vs the FP32 oracle through real dispatch, at N=192 (serial path) and N=4164 (1041 blocks → 66 chunks incl. a 1-block tail: multi-thread fused worker, phase-A stealing, 4-chunk grabs).

Measured

M4 Pro kernel level (Gemma shapes, alternating best-of-5, GF):

shape old dispatch fused dispatch
kv_proj 1×1536×512 44.7 158.2 (3.5×)
o_proj 1×2048×1536 59.2 188.4 (3.2×)
down 1×6144×1536 153.8 268.1
gate_up 1×1536×12288 434.6 671.0
lm_head 1×1536×262144 727.4 901.7

E2E vs a main-baseline engine on the same production legacy bundles (M4: ~1k-token prompt + 32 decode, alternating cycles, cold cycle dropped; re-verified on the final tree):

model prefill tok/s decode tok/s
gemma-4-e2b-it 210 → 233 (+11%) 39.7 → 46.8 (+18%)
qwen3-1.7b 152 → 177 (+17%) 34.7 → 47.9 (+38%)
lfm2-350m 131 → 170 (+30%) 129 → 163 (+26%)

On-device (#698 harness, 512+32 spec, 3 alternating rounds, devices powered, Thermal Status 0):

device model prefill decode
Samsung SM-S942U1 gemma-4-e2b-it 312 → 304 (par) 24.7 → 26.2 (+6%)
qwen3-1.7b 98 → 99 (par) 19.5 → 23.1 (+18%)
lfm2-350m 70 → 105 (+51%) 66.4 → 101.7 (+53%)
Pixel 10a (Tensor G4) gemma-4-e2b-it 86 → 84 (par) 10.5 → 10.5 (par)
qwen3-1.7b 34 → 34 (par) 11.2 → 10.9 (par)
lfm2-350m 38 → 44 (+16%) 36.3 → 40.8 (+12%)

Dispatch overhead is a fixed per-call cost, so the win scales with call rate: largest for small models on fast cores, parity (never a regression) where slow cores already sit at the DRAM roofline. The prefill gains come from the prompt tail that chunked prefill processes per-token at decode rate.

@kar-m

kar-m commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

I think all the .py changes should be reverted and submitted as a separate PR. The kernel change looks good @ncylich

Replace the legacy IL GEMV's three-pass dispatch (pool-parallel Hadamard +
cv wait, serial int8 quantize, static parallel_ranges + cv wait) with a
single fused dispatch: group-stolen phase A behind a spin barrier, dynamic
16-block-chunk stealing in phase B, main thread as worker 0 with a
spin-join, thread budget ceil(chunks / CACTUS_GEMV_SB_PER_THREAD). The
inner micro-kernel is unchanged and factored into
cactus_quant_interleaved4_gemv_blocks.

M4 kernel: kv_proj 44.7 -> 158.2 GF, o_proj 59.2 -> 188.4 GF. E2E decode
on production bundles: gemma-4-e2b-it +19%, qwen3-1.7b +42%, lfm2-350m
+26% (M4); Samsung 8-Elite +6%/+18%/+53%; Pixel Tensor G4 parity (DRAM
roofline) with lfm2 +12-16%.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich changed the title Fused GEMV dispatch for legacy CQ4 bundles Fuse dispatch for the legacy interleaved CQ4 GEMV Jun 11, 2026
@ncylich

ncylich commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

@kar-m you're right, moved them into the upcoming SME PR that I will make next

@ncylich ncylich changed the title Fuse dispatch for the legacy interleaved CQ4 GEMV Fuse dispatch for GEMV Jun 12, 2026
@jakmro jakmro merged commit 4dbc03d into main Jun 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants