Skip to content

[bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length#2489

Merged
yzh119 merged 4 commits intoflashinfer-ai:mainfrom
huangzhilin-hzl:bugfix_2486
Feb 20, 2026
Merged

[bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length#2489
yzh119 merged 4 commits intoflashinfer-ai:mainfrom
huangzhilin-hzl:bugfix_2486

Conversation

@huangzhilin-hzl
Copy link
Copy Markdown
Contributor

@huangzhilin-hzl huangzhilin-hzl commented Feb 4, 2026

📌 Description

bugfix to #2486

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Fixed chunk boundary handling in top-k processing so iterations only cover valid elements when row segments extend beyond actual lengths, preventing out-of-range processing and improving stability.
  • Tests

    • Expanded test coverage for variable-length page-table transforms to include larger lengths, validating behavior for extended inputs.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @huangzhilin-hzl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in the RadixTopKKernel_Unified CUDA kernel by rectifying the calculation of chunk_end. The fix ensures that the chunk processing boundary is correctly determined by stride instead of length, which previously caused issues in multi-CTA collaboration when max_len exceeded length. This change prevents potential out-of-bounds memory access and ensures the stability and correctness of the Top-K selection algorithm.

Highlights

  • Corrected chunk_end calculation: The chunk_end variable, used to define the boundary of processing chunks within the RadixTopKKernel_Unified CUDA kernel, was incorrectly using length as its upper bound. This has been corrected to use stride instead, addressing a bug where max_len > length could lead to incorrect behavior in multi-CTA collaboration scenarios.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • include/flashinfer/topk.cuh
    • Replaced length with stride in the min function for chunk_end calculation within the RadixTopKKernel_Unified kernel at two locations (lines 892 and 952). This corrects an indexing issue that could occur under specific conditions.
Activity
  • The pull request is currently marked as Work In Progress (WIP).
  • The author, huangzhilin-hzl, has submitted the initial changes.
  • Pre-commit and test checklists are included in the PR description but are currently unchecked, indicating pending actions or review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

Adjusted per-CTA chunk-size/boundary logic in RadixTopKKernel_Unified so loops compute and use an actual_chunk_size that is zero when a chunk starts beyond the row length, preventing iterations over invalid elements. Added a test expansion for variable-length page table transforms.

Changes

Cohort / File(s) Summary
TopK kernel boundary guards
include/flashinfer/topk.cuh
Compute actual_chunk_size with a guard ((chunk_start < length) ? (chunk_end - chunk_start) : 0) and use it for per-CTA loops in both the trivial (k >= length) and main paths to avoid processing past length.
Tests — variable-length coverage
tests/utils/test_topk.py
Expanded test options for page table transform variable lengths by adding 131072 to max_len choices to increase coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

v0.6.1

Suggested reviewers

  • jiahanc
  • kahyunnam
  • IwakuraRein
  • yzh119

Poem

🐰 A nibble, a hop, I check each stride,
Guarding chunks where stray bits hide.
No extra loops, no out-of-bounds fright,
Rows stay tidy, kernels run right. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description includes a reference to the related issue but lacks detailed explanation of what the PR does, why it's needed, and the checklist items are unchecked. Expand the description section to explain the specific problem being fixed and the solution implemented. Also verify that pre-commit and test checklist items are completed.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately describes the main bugfix: correcting chunk_end calculation in multi-CTA collaboration when max_len > length, which is directly reflected in both modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request attempts to fix a bug in the chunk_end calculation for multi-CTA top-k when the logical length is smaller than the stride. The proposed change from length to stride does prevent an unsigned integer underflow, but it introduces a new issue by causing the kernel to process invalid data. My review provides an alternative solution that correctly handles the chunk boundaries without reading garbage data, addressing the root cause of the underflow bug. I've suggested applying this fix in both places where chunk_end is calculated for consistency and robustness.

Comment thread include/flashinfer/topk.cuh Outdated

const uint32_t chunk_start = cta_in_group * chunk_size;
const uint32_t chunk_end = min(chunk_start + chunk_size, length);
const uint32_t chunk_end = min(chunk_start + chunk_size, stride);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using stride here fixes a potential unsigned underflow issue but introduces a logical bug. When length < stride, this change causes the kernel to read and process garbage data from the input tensor in the range [length, stride). This can lead to an incorrect pivot selection and wrong top-k results.

The original bug is that actual_chunk_size (on the next line) can underflow if chunk_start >= length. A better way to fix this without processing invalid data is to ensure chunk_end is never less than chunk_start.

I suggest this change to correctly clip the chunk to the valid length while avoiding the underflow:

    const uint32_t chunk_end = max(chunk_start, min(chunk_start + chunk_size, length));

Comment thread include/flashinfer/topk.cuh Outdated
// k >= vocab_size: return all indices
const uint32_t chunk_start = cta_in_group * chunk_size;
const uint32_t chunk_end = min(chunk_start + chunk_size, length);
const uint32_t chunk_end = min(chunk_start + chunk_size, stride);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this change is a no-op in Basic mode because length equals stride, it's better to use length for clarity, as the logic is concerned with processing length elements.

The same potential for unsigned underflow in the loop bound chunk_end - chunk_start exists here if length could be less than stride in other contexts. Although it's not an issue in Basic mode currently, for robustness and consistency with the suggested fix at line 952, I recommend applying the same safe pattern here.

        const uint32_t chunk_end = max(chunk_start, min(chunk_start + chunk_size, length));

@huangzhilin-hzl
Copy link
Copy Markdown
Contributor Author

Hi @yzh119
The PR with the fix is now ready for review. Could you please take a look when you have a moment?

For context, this fix addresses the issue we discussed in #2486

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Feb 18, 2026

@flashinfer run

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Feb 18, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !325 has been created, and the CI pipeline #44326765 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #44326765: 9/20 passed

@yzh119 yzh119 changed the title [WIP][bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length Feb 20, 2026
Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed UT is not relevant.

@yzh119 yzh119 merged commit ace3d15 into flashinfer-ai:main Feb 20, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants