[bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length#2489
Conversation
Summary of ChangesHello @huangzhilin-hzl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical bug in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughAdjusted per-CTA chunk-size/boundary logic in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request attempts to fix a bug in the chunk_end calculation for multi-CTA top-k when the logical length is smaller than the stride. The proposed change from length to stride does prevent an unsigned integer underflow, but it introduces a new issue by causing the kernel to process invalid data. My review provides an alternative solution that correctly handles the chunk boundaries without reading garbage data, addressing the root cause of the underflow bug. I've suggested applying this fix in both places where chunk_end is calculated for consistency and robustness.
|
|
||
| const uint32_t chunk_start = cta_in_group * chunk_size; | ||
| const uint32_t chunk_end = min(chunk_start + chunk_size, length); | ||
| const uint32_t chunk_end = min(chunk_start + chunk_size, stride); |
There was a problem hiding this comment.
Using stride here fixes a potential unsigned underflow issue but introduces a logical bug. When length < stride, this change causes the kernel to read and process garbage data from the input tensor in the range [length, stride). This can lead to an incorrect pivot selection and wrong top-k results.
The original bug is that actual_chunk_size (on the next line) can underflow if chunk_start >= length. A better way to fix this without processing invalid data is to ensure chunk_end is never less than chunk_start.
I suggest this change to correctly clip the chunk to the valid length while avoiding the underflow:
const uint32_t chunk_end = max(chunk_start, min(chunk_start + chunk_size, length));
| // k >= vocab_size: return all indices | ||
| const uint32_t chunk_start = cta_in_group * chunk_size; | ||
| const uint32_t chunk_end = min(chunk_start + chunk_size, length); | ||
| const uint32_t chunk_end = min(chunk_start + chunk_size, stride); |
There was a problem hiding this comment.
While this change is a no-op in Basic mode because length equals stride, it's better to use length for clarity, as the logic is concerned with processing length elements.
The same potential for unsigned underflow in the loop bound chunk_end - chunk_start exists here if length could be less than stride in other contexts. Although it's not an issue in Basic mode currently, for robustness and consistency with the suggested fix at line 952, I recommend applying the same safe pattern here.
const uint32_t chunk_end = max(chunk_start, min(chunk_start + chunk_size, length));
|
@flashinfer run |
|
/bot run |
|
[FAILED] Pipeline #44326765: 9/20 passed |
yzh119
left a comment
There was a problem hiding this comment.
Failed UT is not relevant.
📌 Description
bugfix to #2486
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Bug Fixes
Tests