Fuse dispatch for GEMV by ncylich · Pull Request #717 · cactus-compute/cactus

ncylich · 2026-06-11T09:21:54Z

A single kernel change (2 files, no format or SME2 code, no Python): the legacy interleaved CQ4 GEMV's dispatch is fused into one pool pass. Every production bundle that exists today benefits as-is. #709 stacks on this.

What changes

cactus_quant_4bit_gemv_interleaved previously paid three passes per call: a pool-parallel Hadamard transform with a condition-variable join, a serial per-group int8 quantize on the main thread, then a second static-partition dispatch with another cv join. It now runs one fused dispatch via a shared cactus_quant_two_phase_run driver:

phase A (per-group Hadamard + int8 quantize) stolen by workers behind a spin barrier
phase B dynamically steals 16-block (64-channel) chunks
the main thread participates as worker 0 and spin-joins (a cv sleep costs ~5-10us per call, material at decode rates)
thread budget ceil(chunks / CACTUS_GEMV_SB_PER_THREAD) (default 8)

The 7-op inner micro-kernel (factored into cactus_quant_interleaved4_gemv_blocks) and all numerics are unchanged. CQ4 INTERLEAVED_4ROW only — the production legacy format.

Tests

IL fixture (exact inverse of the shipped decoder) vs the FP32 oracle through real dispatch, at N=192 (serial path) and N=4164 (1041 blocks → 66 chunks incl. a 1-block tail: multi-thread fused worker, phase-A stealing, 4-chunk grabs).

Measured

M4 Pro kernel level (Gemma shapes, alternating best-of-5, GF):

shape	old dispatch	fused dispatch
kv_proj 1×1536×512	44.7	158.2 (3.5×)
o_proj 1×2048×1536	59.2	188.4 (3.2×)
down 1×6144×1536	153.8	268.1
gate_up 1×1536×12288	434.6	671.0
lm_head 1×1536×262144	727.4	901.7

E2E vs a main-baseline engine on the same production legacy bundles (M4: ~1k-token prompt + 32 decode, alternating cycles, cold cycle dropped; re-verified on the final tree):

model	prefill tok/s	decode tok/s
gemma-4-e2b-it	210 → 233 (+11%)	39.7 → 46.8 (+18%)
qwen3-1.7b	152 → 177 (+17%)	34.7 → 47.9 (+38%)
lfm2-350m	131 → 170 (+30%)	129 → 163 (+26%)

On-device (#698 harness, 512+32 spec, 3 alternating rounds, devices powered, Thermal Status 0):

device	model	prefill	decode
Samsung SM-S942U1	gemma-4-e2b-it	312 → 304 (par)	24.7 → 26.2 (+6%)
	qwen3-1.7b	98 → 99 (par)	19.5 → 23.1 (+18%)
	lfm2-350m	70 → 105 (+51%)	66.4 → 101.7 (+53%)
Pixel 10a (Tensor G4)	gemma-4-e2b-it	86 → 84 (par)	10.5 → 10.5 (par)
	qwen3-1.7b	34 → 34 (par)	11.2 → 10.9 (par)
	lfm2-350m	38 → 44 (+16%)	36.3 → 40.8 (+12%)

Dispatch overhead is a fixed per-call cost, so the win scales with call rate: largest for small models on fast cores, parity (never a regression) where slow cores already sit at the DRAM roofline. The prefill gains come from the prompt tail that chunked prefill processes per-token at decode rate.

kar-m · 2026-06-11T20:24:05Z

I think all the .py changes should be reverted and submitted as a separate PR. The kernel change looks good @ncylich

Replace the legacy IL GEMV's three-pass dispatch (pool-parallel Hadamard + cv wait, serial int8 quantize, static parallel_ranges + cv wait) with a single fused dispatch: group-stolen phase A behind a spin barrier, dynamic 16-block-chunk stealing in phase B, main thread as worker 0 with a spin-join, thread budget ceil(chunks / CACTUS_GEMV_SB_PER_THREAD). The inner micro-kernel is unchanged and factored into cactus_quant_interleaved4_gemv_blocks. M4 kernel: kv_proj 44.7 -> 158.2 GF, o_proj 59.2 -> 188.4 GF. E2E decode on production bundles: gemma-4-e2b-it +19%, qwen3-1.7b +42%, lfm2-350m +26% (M4); Samsung 8-Elite +6%/+18%/+53%; Pixel Tensor G4 parity (DRAM roofline) with lfm2 +12-16%. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich · 2026-06-11T20:47:48Z

@kar-m you're right, moved them into the upcoming SME PR that I will make next

ncylich mentioned this pull request Jun 11, 2026

Packed-panel CQ4 weight format: NEON + SME2 kernels #709

Open

ncylich force-pushed the gemv-dispatch branch 2 times, most recently from c82d2bc to 72f7a99 Compare June 11, 2026 09:58

ncylich force-pushed the gemv-dispatch branch from 72f7a99 to 5c02052 Compare June 11, 2026 20:35

ncylich changed the title ~~Fused GEMV dispatch for legacy CQ4 bundles~~ Fuse dispatch for the legacy interleaved CQ4 GEMV Jun 11, 2026

ncylich mentioned this pull request Jun 11, 2026

Pad the chunked-prefill tail for sliding-window models #716

Merged

ncylich changed the title ~~Fuse dispatch for the legacy interleaved CQ4 GEMV~~ Fuse dispatch for GEMV Jun 12, 2026

jakmro merged commit 4dbc03d into main Jun 12, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse dispatch for GEMV#717

Fuse dispatch for GEMV#717
jakmro merged 1 commit into
mainfrom
gemv-dispatch

ncylich commented Jun 11, 2026 •

edited

Loading

Uh oh!

kar-m commented Jun 11, 2026

Uh oh!

ncylich commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ncylich commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Tests

Measured

Uh oh!

kar-m commented Jun 11, 2026

Uh oh!

ncylich commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ncylich commented Jun 11, 2026 •

edited

Loading