Skip to content

Enable causal blocking#908

Open
kdulla wants to merge 32 commits intoquic:mainfrom
kdulla:enable_causal_blocking
Open

Enable causal blocking#908
kdulla wants to merge 32 commits intoquic:mainfrom
kdulla:enable_causal_blocking

Conversation

@kdulla
Copy link
Copy Markdown
Contributor

@kdulla kdulla commented Apr 6, 2026

Summary

  • This PR introduces QEfficient attention blocking for causal language models, adding Head, Batch, KV and Q blocking strategies that preserve numerical outputs and scalability at long sequence lengths.
  • New PR made to avoid rebase issues

Key Features

  • KV Blocking: Blocked compute over KV cache along sequence dim.
  • Q Blocking: Blocked compute over query sequence.
  • Head Blocking: Blocked compute over num heads, based on TS
  • Batch Blocking: Blocked compute over batch, still needs compiler changes to work correctly so not auto computed by default, must be manually passed.
  • Currently, Configurable via qaic_config (explicit or auto). Blocking is disabled by default but can be enabled by passing "enable_blocking": True via qaic_config
  • Created a generic interface for attention that can be called across different models for easier extension of attention methods in the future.
  • Supported Model Architectures for Blocking:
    • Llama3
    • Gpt-oss
    • Gemma
    • Gemma2
    • Granite
    • GraniteMOE
    • Mistral
    • Mixtral
    • MPT
    • Qwen2
    • Qwen3 (text only)
    • Qwen3MOE (text only)
    • Qwen3.5VL
    • Qwen3.5VL MOE
    • Starcoder

WIP

  • Implement the Auto Blocking
  • Unit tests

@vbaddi
Copy link
Copy Markdown
Contributor

vbaddi commented Apr 6, 2026

@kdulla pls fix the DCO, thanks

vbaddi and others added 27 commits April 8, 2026 08:46
- Add strategy registry and AttentionBlockingConfig for extensible blocking
- Implement BlockedKVAttentionTransform for supported attention modules
- Add auto-blocking policy with device and model specific params.
- Integrate KV blocking into Llama-like models using QEffDynamicCache

Key components:
* attention_blocking.py: Strategy registry and config
* attention_blocking_policy.py: Auto-derive policy
* blocked_attention_utils.py: KV blocked attention kernels
* pytorch_transforms.py: Module-level blocking application

Usage:
  qaic_config = {"num_kv_blocks": 2} # explicit
  qaic_config = {"attn_blocking_auto": True}  # automatic

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…ic_config

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com>
…ferent kinds of blocking

Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com>
Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
…in compile

Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
…face and calls to blocking directory

Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
…blocked interface to further models.

Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
@kdulla kdulla force-pushed the enable_causal_blocking branch from d5dfdb8 to cdd9823 Compare April 8, 2026 10:53
Signed-off-by: Kushal Dulla <quic_kdulla@quicinc.com>
@kdulla kdulla force-pushed the enable_causal_blocking branch from 1de9644 to f8e61d5 Compare April 9, 2026 09:25
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
@kdulla kdulla force-pushed the enable_causal_blocking branch from f8e61d5 to aa328f9 Compare April 9, 2026 09:30
@vishwasdivakar
Copy link
Copy Markdown

@vbaddi, PR908, with the latest commit fails to export Qwen2.5-VL-32B (same export script used as PR774)

commit- aa328f9

kdulla added 2 commits April 10, 2026 06:33
Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
…y have onnx

Signed-off-by: Kushal Dulla <kdulla@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants