feat(Llama GQA): optional shared-KV matmul rewrite for attention. WIP: (env-gated) by vbaddi · Pull Request #905 · quic/efficient-transformers

vbaddi · 2026-04-03T10:09:02Z

This PR adds an optional Llama GQA attention rewrite that avoids explicit KV expansion in the normal eager attention path. The optimization is gated by an environment variable and is OFF by default, so baseline behavior is unchanged unless enabled.

How to enable for testing
export QEFF_LLAMA_GQA_SHARED_KV=1

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

llama: add optional GQA shared-KV attention path behind env flag

1ed2e67

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi marked this pull request as draft April 3, 2026 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Llama GQA): optional shared-KV matmul rewrite for attention. WIP: (env-gated)#905

feat(Llama GQA): optional shared-KV matmul rewrite for attention. WIP: (env-gated)#905
vbaddi wants to merge 1 commit intoquic:mainfrom
vbaddi:feat/enable_llama_gqa_shared_kv

vbaddi commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vbaddi commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant