Skip to content

[Feature] Add SFA MLA prolog v3 path#10294

Draft
ZYang6263 wants to merge 3 commits into
vllm-project:mainfrom
ZYang6263:codex/sfa-mla-prolog-v3
Draft

[Feature] Add SFA MLA prolog v3 path#10294
ZYang6263 wants to merge 3 commits into
vllm-project:mainfrom
ZYang6263:codex/sfa-mla-prolog-v3

Conversation

@ZYang6263

@ZYang6263 ZYang6263 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an SFA-specific mla_prolog_v3 path and extends it to the packed int8 KV-cache case consumed by torch_npu.npu_kv_quant_sparse_flash_attention.

This path is intentionally independent from enable_fa_quant and the existing mla_v1.py flow. It is enabled with:

VLLM_ASCEND_ENABLE_SFA_PROLOG_V3=1
VLLM_ASCEND_ENABLE_SFA_KV_QUANT_SPARSE_ATTENTION=1

Changes

  • Rebased the PR branch onto current main (ab065ffb) and resolved the sfa_v1.py conflict.
  • Adds enable_sfa_prolog_v3 / VLLM_ASCEND_ENABLE_SFA_PROLOG_V3 config plumbing.
  • Adds enable_sfa_kv_quant_sparse_attention / VLLM_ASCEND_ENABLE_SFA_KV_QUANT_SPARSE_ATTENTION config plumbing.
  • Prepares mla_prolog_v3 weight handles from W8A8 dynamic loaded weights in sfa_v1.py.
  • Adds the non-quantized KV-cache prolog path with kv_cache_quant_mode=0.
  • Adds the packed int8 KV-cache path with kv_cache_quant_mode=3, ckvkr_repo_mode=1, quant_scale_repo_mode=1, and tile_size=128.
  • Extends MLA sparse KV cache spec and model runner allocation so kv_cache[0] is packed int8 cache, kv_cache[1] is an empty kr_cache, and DSA indexer cache remains in later tuple entries.
  • Routes SFA and SFA-CP attention to torch_npu.npu_kv_quant_sparse_flash_attention when the packed int8 cache path is enabled.
  • Keeps A5 on the existing Sparse C8 FP8 CKV path; the new switch targets non-A5 int8 KV cache.
  • Adds design documentation under docs/source/developer_guide/Design_Documents/sfa_mla_prolog_v3_kv_quant.md.
  • Adds unit coverage for config plumbing, packed KV cache allocation, prolog kwargs, and QSFA operator dispatch.

Validation

  • git diff --check
  • Not run: unit tests, because this local PowerShell environment does not have python, py, or uv on PATH.

Notes

The packed cache dimension is computed as:

kv_lora_rank + qk_rope_head_dim * 2 + (kv_lora_rank / 128) * 4

For the common MLA shape this is 512 + 64 * 2 + 4 * 4 = 656.

@github-actions

Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new SFA MLA prolog v3 path, specifically designed to enhance performance for models utilizing W8A8 dynamic weights while maintaining non-quantized KV cache and query. The changes involve adding a new configuration option to enable this feature, alongside the necessary backend logic to process weights and inputs according to the specified quantization modes. This update aims to provide a more optimized preprocessing pipeline for a particular set of quantization requirements.

Highlights

  • New SFA MLA Prolog v3 Path: Introduced a new SFA-specific mla_prolog_v3 preprocessing path, enabling optimized handling for W8A8 dynamic weights with non-quantized KV cache and query.
  • Configurability: Added a new configuration flag enable_sfa_prolog_v3 (and its environment variable VLLM_ASCEND_ENABLE_SFA_PROLOG_V3) to control the activation of the new prolog path.
  • Weight and Quantization Handling: Implemented specific weight processing (_process_weights_for_fused_prolog_v3) and input formatting for mla_prolog_v3, ensuring compatibility with W8A8 dynamic quantization for fused QKV and Q projections.
  • Temporary TND Cache Handling: Included temporary handling for TND cache mode within the mla_prolog_v3 flow, allowing for prolog execution before existing all-gather and paged-cache writeback.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Attention][Feature] Support SFA preprocessing with mla_prolog_v3

Suggested PR Summary:

### What this PR does / why we need it?
This PR introduces support for SFA preprocessing using `mla_prolog_v3` in the Ascend backend. It adds the configuration option `enable_sfa_prolog_v3` (controlled via the `VLLM_ASCEND_ENABLE_SFA_PROLOG_V3` environment variable) and implements the corresponding weight processing, input formatting, and preprocessing steps in the SFA attention implementation.

Feedback:
An issue was identified in `vllm_ascend/attention/sfa_v1.py` where the private API `torch_npu._npu_reshape_and_cache` is called directly. It is recommended to use the `DeviceOperator.reshape_and_cache` abstraction instead and slice the inputs to `attn_metadata.num_actual_tokens` to ensure compatibility and correct shape alignment.

### Does this PR introduce _any_ user-facing change?
Yes, it introduces a new environment variable `VLLM_ASCEND_ENABLE_SFA_PROLOG_V3` and a configuration option `enable_sfa_prolog_v3` to enable SFA preprocessing with `mla_prolog_v3`.

### How was this patch tested?
The patch was tested with unit tests in `tests/ut/test_ascend_config.py` verifying the configuration fallback and override behavior.

Comment thread vllm_ascend/attention/sfa_v1.py Outdated
Comment on lines +1387 to +1393
torch_npu._npu_reshape_and_cache(
key=k_nope,
value=k_pe,
key_cache=kv_cache[0],
value_cache=kv_cache[1],
slot_indices=slot_mapping,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling the private/internal API torch_npu._npu_reshape_and_cache directly is discouraged as it can lead to compatibility issues across different PyTorch/CANN versions. Additionally, passing un-sliced k_nope, k_pe, and slot_mapping can cause shape mismatches or incorrect cache writing if there is padding or if the gathered tensor size does not align with the un-sliced slot mapping.\n\nPlease use the established DeviceOperator.reshape_and_cache abstraction instead, and slice the inputs to attn_metadata.num_actual_tokens to ensure shape alignment and correctness, matching the pattern used in _all_gather_and_cache_dsa_cp_kv.

                DeviceOperator.reshape_and_cache(\n                    key=k_nope[: attn_metadata.num_actual_tokens],\n                    value=k_pe[: attn_metadata.num_actual_tokens],\n                    key_cache=kv_cache[0],\n                    value_cache=kv_cache[1],\n                    slot_mapping=slot_mapping[: attn_metadata.num_actual_tokens],\n                )

@ZYang6263 ZYang6263 changed the title [codex] Add SFA MLA prolog v3 path Add SFA MLA prolog v3 path Jun 26, 2026
@ZYang6263 ZYang6263 force-pushed the codex/sfa-mla-prolog-v3 branch from bf002e6 to 4913778 Compare June 26, 2026 15:30
@github-actions github-actions Bot added documentation Improvements or additions to documentation and removed merge-conflicts labels Jun 26, 2026
@ZYang6263 ZYang6263 changed the title Add SFA MLA prolog v3 path [Feature] Add SFA MLA prolog v3 path Jun 26, 2026
@ZYang6263 ZYang6263 force-pushed the codex/sfa-mla-prolog-v3 branch from 4913778 to 03d36db Compare June 26, 2026 15:34
Signed-off-by: ZYang6263 <zy626375@gmail.com>
@ZYang6263 ZYang6263 force-pushed the codex/sfa-mla-prolog-v3 branch from 6228f48 to 25b7e53 Compare June 27, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:core module:tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant