split attention template via data types#270
Open
xinyu-intel wants to merge 1 commit intovllm-project:mainfrom
Open
split attention template via data types#270xinyu-intel wants to merge 1 commit intovllm-project:mainfrom
xinyu-intel wants to merge 1 commit intovllm-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the Xe2 XPU attention (chunk prefill + paged decode) kernel instantiation and dispatch to be specialized by explicit Q/KV/O data types, rather than relying on runtime dtype branching inside the instantiated templates.
Changes:
- Introduces
CutlassQKOTypeandaten_to_Cutlass_qko_dtype()to carry Q/K/O dtypes through the dispatch path. - Splits the dispatch into a typed implementation (
*_dispatch_typed_impl<..., ElementQ, ElementKV, ElementO, ...>) plus a runtime dtype switch that calls into those typed instantiations. - Updates CMake/kernel templates/extern declarations to generate and declare explicit instantiations for the allowed dtype combinations.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| csrc/xpu/attn/xe_2/paged_decode.hpp | Adds typed dispatch helper and updates runtime dtype dispatch to include O dtype. |
| csrc/xpu/attn/xe_2/paged_decode_xe2.cpp | Switches paged decode to use Q/K/O dtype triplet in runtime dispatch. |
| csrc/xpu/attn/xe_2/paged_decode_utils.hpp | Propagates CutlassQKOType through the decode dispatch helpers. |
| csrc/xpu/attn/xe_2/paged_decode_kernel_template.cpp.in | Updates generated instantiation to target the typed dispatch function and dtype macros. |
| csrc/xpu/attn/xe_2/paged_decode_extern.hpp | Expands extern template declarations across allowed dtype combinations. |
| csrc/xpu/attn/xe_2/paged_decode_configure.cmake | Generates kernel sources across (dtype combo × bool flags × policy variants). |
| csrc/xpu/attn/xe_2/fmha_xe2.cpp | Switches chunk prefill to Q/K/O dtype triplet dispatch. |
| csrc/xpu/attn/xe_2/fmha_utils.hpp | Replaces Q/K dtype pair with Q/K/O dtype triplet utilities/types. |
| csrc/xpu/attn/xe_2/chunk_prefill.hpp | Adds typed dispatch helper and updates runtime dtype dispatch to include O dtype. |
| csrc/xpu/attn/xe_2/chunk_prefill_utils.hpp | Propagates CutlassQKOType through the chunk prefill dispatch helpers. |
| csrc/xpu/attn/xe_2/chunk_prefill_kernel_template.cpp.in | Updates generated instantiation to target the typed dispatch function and dtype macros. |
| csrc/xpu/attn/xe_2/chunk_prefill_extern.hpp | Expands extern template declarations across allowed dtype combinations. |
| csrc/xpu/attn/xe_2/chunk_prefill_configure.cmake | Generates chunk-prefill kernel sources across dtype combinations and bool permutations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+595
to
+599
| } else if (cuQKOType.k_type == CutlassDType::float8_e5m2) { | ||
| return decode_policy_dispatch_typed_impl< | ||
| decode_policy, | ||
| bfloat16_t, | ||
| float_e5m2_t, |
Comment on lines
+354
to
+356
| if (cuQKOType.o_type == CutlassDType::half) { | ||
| if (cuQKOType.q_type == CutlassDType::half) { | ||
| if (cuQKOType.k_type == CutlassDType::half) { |
Comment on lines
+69
to
+77
| # Allowed dtype combinations must match runtime dispatch constraints. Format: | ||
| # Q_TYPE|KV_TYPE|O_TYPE|FILE_TAG | ||
| set(dtype_combo_list | ||
| "half_t|half_t|half_t|h_h_h" | ||
| "half_t|float_e4m3_t|half_t|h_e4_h" | ||
| "half_t|float_e5m2_t|half_t|h_e5_h" | ||
| "bfloat16_t|bfloat16_t|bfloat16_t|b_b_b" | ||
| "bfloat16_t|float_e4m3_t|bfloat16_t|b_e4_b" | ||
| "bfloat16_t|float_e5m2_t|bfloat16_t|b_e5_b") |
Comment on lines
+387
to
+388
| TORCH_CHECK(false, "Unsupported KV dtype for chunk prefill dispatch"); | ||
| } |
addd992 to
3dca5b1
Compare
Collaborator
Author
|
@baodii @YizhouZ @jikunshang pls review, it breaks down dtypes and will be helpful for further development. |
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
(Optional) Documentation Update
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)