[bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by huangzhilin-hzl · Pull Request #2489 · flashinfer-ai/flashinfer

huangzhilin-hzl · 2026-02-04T13:35:19Z

📌 Description

bugfix to #2486

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Fixed chunk boundary handling in top-k processing so iterations only cover valid elements when row segments extend beyond actual lengths, preventing out-of-range processing and improving stability.
Tests
- Expanded test coverage for variable-length page-table transforms to include larger lengths, validating behavior for extended inputs.

gemini-code-assist · 2026-02-04T13:35:33Z

Summary of Changes

Hello @huangzhilin-hzl, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the RadixTopKKernel_Unified CUDA kernel by rectifying the calculation of chunk_end. The fix ensures that the chunk processing boundary is correctly determined by stride instead of length, which previously caused issues in multi-CTA collaboration when max_len exceeded length. This change prevents potential out-of-bounds memory access and ensures the stability and correctness of the Top-K selection algorithm.

Highlights

Corrected chunk_end calculation: The chunk_end variable, used to define the boundary of processing chunks within the RadixTopKKernel_Unified CUDA kernel, was incorrectly using length as its upper bound. This has been corrected to use stride instead, addressing a bug where max_len > length could lead to incorrect behavior in multi-CTA collaboration scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

include/flashinfer/topk.cuh
- Replaced length with stride in the min function for chunk_end calculation within the RadixTopKKernel_Unified kernel at two locations (lines 892 and 952). This corrects an indexing issue that could occur under specific conditions.

Activity

The pull request is currently marked as Work In Progress (WIP).
The author, huangzhilin-hzl, has submitted the initial changes.
Pre-commit and test checklists are included in the PR description but are currently unchecked, indicating pending actions or review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-04T13:35:34Z

📝 Walkthrough

Walkthrough

Adjusted per-CTA chunk-size/boundary logic in RadixTopKKernel_Unified so loops compute and use an actual_chunk_size that is zero when a chunk starts beyond the row length, preventing iterations over invalid elements. Added a test expansion for variable-length page table transforms.

Changes

Cohort / File(s)	Summary
TopK kernel boundary guards `include/flashinfer/topk.cuh`	Compute `actual_chunk_size` with a guard ((chunk_start < length) ? (chunk_end - chunk_start) : 0) and use it for per-CTA loops in both the trivial (k >= length) and main paths to avoid processing past `length`.
Tests — variable-length coverage `tests/utils/test_topk.py`	Expanded test options for page table transform variable lengths by adding `131072` to `max_len` choices to increase coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Fix: FilteredTopKUnifiedKernel read value out of length #2308 — Similar fixes in include/flashinfer/topk.cuh to tighten per-CTA/vector loop boundary checks and prevent out-of-bounds reads.
bugfix: fix multi-cta top-k implementation when k value is different for different row #2325 — Related changes around RadixTopKKernel_Unified handling for k >= length (adds histogram clears in multi-CTA path).
feat: further optimize top-k and add fused top-k page construction kernels for DSA #2215 — Prior Top-K implementation changes touching kernel paths and controls relevant to these boundary adjustments.

Suggested labels

v0.6.1

Suggested reviewers

jiahanc
kahyunnam
IwakuraRein
yzh119

Poem

🐰 A nibble, a hop, I check each stride,
Guarding chunks where stray bits hide.
No extra loops, no out-of-bounds fright,
Rows stay tidy, kernels run right. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description includes a reference to the related issue but lacks detailed explanation of what the PR does, why it's needed, and the checklist items are unchecked.	Expand the description section to explain the specific problem being fixed and the solution implemented. Also verify that pre-commit and test checklist items are completed.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title accurately describes the main bugfix: correcting chunk_end calculation in multi-CTA collaboration when max_len > length, which is directly reflected in both modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request attempts to fix a bug in the chunk_end calculation for multi-CTA top-k when the logical length is smaller than the stride. The proposed change from length to stride does prevent an unsigned integer underflow, but it introduces a new issue by causing the kernel to process invalid data. My review provides an alternative solution that correctly handles the chunk boundaries without reading garbage data, addressing the root cause of the underflow bug. I've suggested applying this fix in both places where chunk_end is calculated for consistency and robustness.

gemini-code-assist · 2026-02-04T13:38:34Z


    const uint32_t chunk_start = cta_in_group * chunk_size;
-    const uint32_t chunk_end = min(chunk_start + chunk_size, length);
+    const uint32_t chunk_end = min(chunk_start + chunk_size, stride);


Using stride here fixes a potential unsigned underflow issue but introduces a logical bug. When length < stride, this change causes the kernel to read and process garbage data from the input tensor in the range [length, stride). This can lead to an incorrect pivot selection and wrong top-k results.

The original bug is that actual_chunk_size (on the next line) can underflow if chunk_start >= length. A better way to fix this without processing invalid data is to ensure chunk_end is never less than chunk_start.

I suggest this change to correctly clip the chunk to the valid length while avoiding the underflow:

const uint32_t chunk_end = max(chunk_start, min(chunk_start + chunk_size, length));

gemini-code-assist · 2026-02-04T13:38:34Z

        // k >= vocab_size: return all indices
        const uint32_t chunk_start = cta_in_group * chunk_size;
-        const uint32_t chunk_end = min(chunk_start + chunk_size, length);
+        const uint32_t chunk_end = min(chunk_start + chunk_size, stride);


While this change is a no-op in Basic mode because length equals stride, it's better to use length for clarity, as the logic is concerned with processing length elements.

The same potential for unsigned underflow in the loop bound chunk_end - chunk_start exists here if length could be less than stride in other contexts. Although it's not an issue in Basic mode currently, for robustness and consistency with the suggested fix at line 952, I recommend applying the same safe pattern here.

const uint32_t chunk_end = max(chunk_start, min(chunk_start + chunk_size, length));

huangzhilin-hzl · 2026-02-09T04:33:10Z

Hi @yzh119
The PR with the fix is now ready for review. Could you please take a look when you have a moment?

For context, this fix addresses the issue we discussed in #2486

yzh119 · 2026-02-18T23:45:23Z

@flashinfer run

yzh119 · 2026-02-18T23:45:46Z

/bot run

flashinfer-bot · 2026-02-18T23:46:38Z

GitLab MR !325 has been created, and the CI pipeline #44326765 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-19T03:53:05Z

[FAILED] Pipeline #44326765: 9/20 passed

yzh119

Failed UT is not relevant.

bugfix

68087b6

huangzhilin-hzl requested review from IwakuraRein, jiahanc, kahyunnam, nv-yunzheq and yzh119 as code owners February 4, 2026 13:35

gemini-code-assist Bot reviewed Feb 4, 2026

View reviewed changes

huangzhilin-hzl and others added 2 commits February 9, 2026 12:04

Merge branch 'flashinfer-ai:main' into bugfix_2486

307e7c6

fix

bd9b3fa

yzh119 mentioned this pull request Feb 18, 2026

[top_k_page_table_transform bug]Illegal memory access in RadixTopKKernel_Unified #2486

Closed

Merge branch 'main' into bugfix_2486

8a78d18

yzh119 mentioned this pull request Feb 19, 2026

[BugFix] guard against uint32 underflow in multi-CTA TopK chunk calculation #2592

Merged

yzh119 changed the title ~~[WIP][bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length~~ [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length Feb 20, 2026

yzh119 approved these changes Feb 20, 2026

View reviewed changes

yzh119 merged commit ace3d15 into flashinfer-ai:main Feb 20, 2026
18 checks passed

Conversation

huangzhilin-hzl commented Feb 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist Bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

huangzhilin-hzl commented Feb 9, 2026

Uh oh!

yzh119 commented Feb 18, 2026

Uh oh!

yzh119 commented Feb 18, 2026

Uh oh!

flashinfer-bot commented Feb 18, 2026

Uh oh!

flashinfer-bot commented Feb 19, 2026

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huangzhilin-hzl commented Feb 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 4, 2026 •

edited

Loading