WebGPU: Add indirect dispatch for flash attention graph capture#29236
Draft
feich-ms wants to merge 6 commits into
Draft
WebGPU: Add indirect dispatch for flash attention graph capture#29236feich-ms wants to merge 6 commits into
feich-ms wants to merge 6 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the WebGPU FlashAttention decode path’s compatibility with graph capture and dynamic sequence lengths by enabling GPU-side computation of indirect dispatch group sizes from seqlen_k (including the kv_empty/shared-KV case that previously could produce dispatch(0)).
Changes:
- Added a dedicated
PrepareIndirectDispatchProgramshader to compute normalized indirect dispatch dimensions on GPU fromseqlen_k. - Expanded
use_seqlen_k/use_indirect_dispatchgating so indirect dispatch is also available whentotal_sequence_length_is unavailable on CPU (includingkv_emptylayers). - Ensured the
kv_emptypath prepares the indirect dispatch buffer even whenCopyKVCacheis skipped.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| onnxruntime/contrib_ops/webgpu/bert/flash_attention.h | Declares PrepareIndirectDispatchProgram and its uniform interface. |
| onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc | Implements the new program and wires indirect-dispatch preparation into the kv_empty path and broadened gating logic. |
- Extract AppendNormalizeDispatchShader() helper so CopyKVCacheProgram and PrepareIndirectDispatchProgram share one copy of the tile-count and normalize_dispatch_group_size call instead of duplicating it. - Express use_indirect_dispatch as use_seqlen_k && (share_buffer || kv_empty) to make the subset relationship explicit and eliminate the repeated condition. - Tighten use_seqlen_k / use_indirect_dispatch guards from == 0 to <= 0 to handle a negative total_sequence_length_ defensively. - Add comment on the WGSL template's normalize call pointing back to the C++ helper so the two stay in sync. Co-Authored-By: Claude <noreply@anthropic.com>
Add two WebGPU GQA tests that exercise PrepareIndirectDispatchProgram: - WebGPU_SharedKV_IndirectDispatch_Decode: kv_empty + total_sequence_length=0 (decode, past_seq=8), triggers use_seqlen_k=true and use_indirect_dispatch=true via the kv_empty path, cross-checked against CPU reference. - WebGPU_SharedKV_IndirectDispatch_LargerPast: same path with past_seq=32 to exercise num_total_seq_length_tile > 1 in the tile count arithmetic. Co-Authored-By: Claude <noreply@anthropic.com>
6ae8f1f to
63159eb
Compare
Co-Authored-By: Claude <noreply@anthropic.com>
Replace deprecated bool use_cuda/use_webgpu params with GqaTargetEp::kCpu. Co-Authored-By: Claude <noreply@anthropic.com>
OpTester cannot enable graph capture, so use_indirect_dispatch is never triggered. Rewrite the tests to exercise the kv_empty path directly with a real positive total_sequence_length instead of 0. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Test plan