[Feature] Add SFA MLA prolog v3 path by ZYang6263 · Pull Request #10294 · vllm-project/vllm-ascend

ZYang6263 · 2026-06-10T12:24:42Z

Summary

Adds an SFA-specific mla_prolog_v3 path and extends it to the packed int8 KV-cache case consumed by torch_npu.npu_kv_quant_sparse_flash_attention.

This path is intentionally independent from enable_fa_quant and the existing mla_v1.py flow. It is enabled with:

VLLM_ASCEND_ENABLE_SFA_PROLOG_V3=1
VLLM_ASCEND_ENABLE_SFA_KV_QUANT_SPARSE_ATTENTION=1

Changes

Rebased the PR branch onto current main (ab065ffb) and resolved the sfa_v1.py conflict.
Adds enable_sfa_prolog_v3 / VLLM_ASCEND_ENABLE_SFA_PROLOG_V3 config plumbing.
Adds enable_sfa_kv_quant_sparse_attention / VLLM_ASCEND_ENABLE_SFA_KV_QUANT_SPARSE_ATTENTION config plumbing.
Prepares mla_prolog_v3 weight handles from W8A8 dynamic loaded weights in sfa_v1.py.
Adds the non-quantized KV-cache prolog path with kv_cache_quant_mode=0.
Adds the packed int8 KV-cache path with kv_cache_quant_mode=3, ckvkr_repo_mode=1, quant_scale_repo_mode=1, and tile_size=128.
Extends MLA sparse KV cache spec and model runner allocation so kv_cache[0] is packed int8 cache, kv_cache[1] is an empty kr_cache, and DSA indexer cache remains in later tuple entries.
Routes SFA and SFA-CP attention to torch_npu.npu_kv_quant_sparse_flash_attention when the packed int8 cache path is enabled.
Keeps A5 on the existing Sparse C8 FP8 CKV path; the new switch targets non-A5 int8 KV cache.
Adds design documentation under docs/source/developer_guide/Design_Documents/sfa_mla_prolog_v3_kv_quant.md.
Adds unit coverage for config plumbing, packed KV cache allocation, prolog kwargs, and QSFA operator dispatch.

Validation

git diff --check
Not run: unit tests, because this local PowerShell environment does not have python, py, or uv on PATH.

Notes

The packed cache dimension is computed as:

kv_lora_rank + qk_rope_head_dim * 2 + (kv_lora_rank / 128) * 4

For the common MLA shape this is 512 + 64 * 2 + 4 * 4 = 656.

vLLM version: v0.23.0
vLLM main: vllm-project/vllm@967c5c3

github-actions · 2026-06-10T12:26:06Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2026-06-10T12:26:18Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist · 2026-06-10T12:34:31Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new SFA MLA prolog v3 path, specifically designed to enhance performance for models utilizing W8A8 dynamic weights while maintaining non-quantized KV cache and query. The changes involve adding a new configuration option to enable this feature, alongside the necessary backend logic to process weights and inputs according to the specified quantization modes. This update aims to provide a more optimized preprocessing pipeline for a particular set of quantization requirements.

Highlights

New SFA MLA Prolog v3 Path: Introduced a new SFA-specific mla_prolog_v3 preprocessing path, enabling optimized handling for W8A8 dynamic weights with non-quantized KV cache and query.
Configurability: Added a new configuration flag enable_sfa_prolog_v3 (and its environment variable VLLM_ASCEND_ENABLE_SFA_PROLOG_V3) to control the activation of the new prolog path.
Weight and Quantization Handling: Implemented specific weight processing (_process_weights_for_fused_prolog_v3) and input formatting for mla_prolog_v3, ensuring compatibility with W8A8 dynamic quantization for fused QKV and Q projections.
Temporary TND Cache Handling: Included temporary handling for TND cache mode within the mla_prolog_v3 flow, allowing for prolog execution before existing all-gather and paged-cache writeback.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Attention][Feature] Support SFA preprocessing with mla_prolog_v3

Suggested PR Summary:

### What this PR does / why we need it?
This PR introduces support for SFA preprocessing using `mla_prolog_v3` in the Ascend backend. It adds the configuration option `enable_sfa_prolog_v3` (controlled via the `VLLM_ASCEND_ENABLE_SFA_PROLOG_V3` environment variable) and implements the corresponding weight processing, input formatting, and preprocessing steps in the SFA attention implementation.

Feedback:
An issue was identified in `vllm_ascend/attention/sfa_v1.py` where the private API `torch_npu._npu_reshape_and_cache` is called directly. It is recommended to use the `DeviceOperator.reshape_and_cache` abstraction instead and slice the inputs to `attn_metadata.num_actual_tokens` to ensure compatibility and correct shape alignment.

### Does this PR introduce _any_ user-facing change?
Yes, it introduces a new environment variable `VLLM_ASCEND_ENABLE_SFA_PROLOG_V3` and a configuration option `enable_sfa_prolog_v3` to enable SFA preprocessing with `mla_prolog_v3`.

### How was this patch tested?
The patch was tested with unit tests in `tests/ut/test_ascend_config.py` verifying the configuration fallback and override behavior.

gemini-code-assist · 2026-06-10T12:37:26Z

+                torch_npu._npu_reshape_and_cache(
+                    key=k_nope,
+                    value=k_pe,
+                    key_cache=kv_cache[0],
+                    value_cache=kv_cache[1],
+                    slot_indices=slot_mapping,
+                )


Calling the private/internal API torch_npu._npu_reshape_and_cache directly is discouraged as it can lead to compatibility issues across different PyTorch/CANN versions. Additionally, passing un-sliced k_nope, k_pe, and slot_mapping can cause shape mismatches or incorrect cache writing if there is padding or if the gathered tensor size does not align with the un-sliced slot mapping.\n\nPlease use the established DeviceOperator.reshape_and_cache abstraction instead, and slice the inputs to attn_metadata.num_actual_tokens to ensure shape alignment and correctness, matching the pattern used in _all_gather_and_cache_dsa_cp_kv.

DeviceOperator.reshape_and_cache(\n key=k_nope[: attn_metadata.num_actual_tokens],\n value=k_pe[: attn_metadata.num_actual_tokens],\n key_cache=kv_cache[0],\n value_cache=kv_cache[1],\n slot_mapping=slot_mapping[: attn_metadata.num_actual_tokens],\n )

Signed-off-by: ZYang6263 <zy626375@gmail.com>

ZYang6263 mentioned this pull request Jun 10, 2026

[codex] Add SFA MLA prolog v3 path ZYang6263/vllm-ascend#7

Closed

github-actions Bot added module:tests module:core labels Jun 10, 2026

github-actions Bot added the merge-conflicts label Jun 10, 2026

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

ZYang6263 changed the title ~~[codex] Add SFA MLA prolog v3 path~~ Add SFA MLA prolog v3 path Jun 26, 2026

Add SFA MLA prolog v3 path

8573d4e

ZYang6263 force-pushed the codex/sfa-mla-prolog-v3 branch from bf002e6 to 4913778 Compare June 26, 2026 15:30

github-actions Bot added documentation Improvements or additions to documentation and removed merge-conflicts labels Jun 26, 2026

ZYang6263 changed the title ~~Add SFA MLA prolog v3 path~~ [Feature] Add SFA MLA prolog v3 path Jun 26, 2026

Add SFA MLA prolog v3 int8 KV sparse attention

03d36db

ZYang6263 force-pushed the codex/sfa-mla-prolog-v3 branch from 4913778 to 03d36db Compare June 26, 2026 15:34

fIX sFA QSFA KV cache split sizing

25b7e53

Signed-off-by: ZYang6263 <zy626375@gmail.com>

ZYang6263 force-pushed the codex/sfa-mla-prolog-v3 branch from 6228f48 to 25b7e53 Compare June 27, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add SFA MLA prolog v3 path#10294

[Feature] Add SFA MLA prolog v3 path#10294
ZYang6263 wants to merge 3 commits into
vllm-project:mainfrom
ZYang6263:codex/sfa-mla-prolog-v3

ZYang6263 commented Jun 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ZYang6263 commented Jun 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Notes

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot commented Jun 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZYang6263 commented Jun 10, 2026 •

edited by github-actions Bot

Loading