fix: guard batchWarpReduceSum with ENABLE_FP8 to fix compilation without FP8#2328
fix: guard batchWarpReduceSum with ENABLE_FP8 to fix compilation without FP8#2328
Conversation
…out FP8 The batchWarpReduceSum function in reduceKernelUtils.cuh depends on the PackType template which is only defined when ENABLE_FP8 is set. This causes compilation errors when including norm.cuh without ENABLE_FP8. Since batchWarpReduceSum is unused (dead code), guard it with #ifdef ENABLE_FP8 to prevent compilation errors. Fixes #2271 Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds a new template function Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)tests/**/*.py📄 CodeRabbit inference engine (CLAUDE.md)
Files:
🧠 Learnings (9)📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
📚 Learning: 2025-12-30T09:34:39.900ZApplied to files:
🧬 Code graph analysis (1)tests/utils/test_norm.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a compilation issue where the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a compilation error that occurs when ENABLE_FP8 is not defined. The batchWarpReduceSum function, which depends on the PackType template, is now conditionally compiled using #ifdef ENABLE_FP8. This change is correctly applied to both csrc/nv_internal/tensorrt_llm/common/reduceKernelUtils.cuh and include/flashinfer/trtllm/common/reduceKernelUtils.cuh, resolving the build failure. The approach is sound and the implementation is correct.
Code ReviewSummaryThis PR fixes a compilation error when including ✅ Strengths
📝 Observations
🔍 Code Quality
💡 Suggestions (Optional, Non-Blocking)
✅ RecommendationLGTM - Ready to merge. This is a well-executed, minimal fix that correctly addresses the compilation issue following established codebase patterns. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
include/flashinfer/trtllm/common/reduceKernelUtils.cuh (1)
161-184: Consider removingbatchWarpReduceSumin a follow-up cleanup.The
#ifdef ENABLE_FP8guard correctly prevents compilation errors when FP8 support is disabled sincePackTypeis only available under that configuration. The comment clearly documents this dependency.However, verification confirms that
batchWarpReduceSumis never called anywhere in the codebase—it appears only 2 times (as identical definitions in two files) with zero function calls. Rather than keeping guarded dead code, remove this unused function entirely unless it's part of a public API contract or planned for future use.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
csrc/nv_internal/tensorrt_llm/common/reduceKernelUtils.cuhinclude/flashinfer/trtllm/common/reduceKernelUtils.cuh
🧰 Additional context used
📓 Path-based instructions (1)
include/**/*.cuh
📄 CodeRabbit inference engine (CLAUDE.md)
include/**/*.cuh: Torch headers MUST NOT be included in files within theinclude/directory - keep framework-agnostic CUDA kernels that accept raw pointers
Kernel code ininclude/flashinfer/is automatically picked up by JIT compilation on changes - no pip reinstall needed
For performance-critical hot paths, leave comments explaining special algorithmic choices and potential alternatives for future reviewers
Consult the PTX ISA documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/) for low-level instruction details and new GPU architecture features when writing inline PTX assembly
Files:
include/flashinfer/trtllm/common/reduceKernelUtils.cuh
🧠 Learnings (1)
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Consult the PTX ISA documentation (https://docs.nvidia.com/cuda/parallel-thread-execution/) for low-level instruction details and new GPU architecture features when writing inline PTX assembly
Applied to files:
include/flashinfer/trtllm/common/reduceKernelUtils.cuh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Deploy Docs
- GitHub Check: claude-review
🔇 Additional comments (1)
csrc/nv_internal/tensorrt_llm/common/reduceKernelUtils.cuh (1)
161-184: Consistent fix applied correctly.The
#ifdef ENABLE_FP8guard is correctly applied here, maintaining consistency with the parallel header file ininclude/flashinfer/trtllm/common/reduceKernelUtils.cuh. The identical implementation ensures both code paths remain synchronized.
| return val; | ||
| } | ||
|
|
||
| #ifdef ENABLE_FP8 |
There was a problem hiding this comment.
@claude write a simple python unittest to verify norm module compiles when ENABLE_FP8 is disabled.
There was a problem hiding this comment.
Claude finished @yzh119's task in 2m 3s —— View job
Task: Write unittest to verify norm compilation without FP8 ✅
Completed Tasks:
- Read CLAUDE.md to understand testing patterns
- Examine existing norm tests to understand the pattern
- Understand how ENABLE_FP8 is controlled
- Create a test that verifies norm module compiles without FP8
- Add test to appropriate test file
- Commit and push changes
Test Created:
Added test_norm_compilation_without_fp8() to tests/utils/test_norm.py. This test:
- Creates a JIT spec for the norm module without the
-DENABLE_FP8flag - Attempts to build and load the module via JIT compilation
- Verifies compilation succeeds (which would fail before the fix due to
PackTypebeing undefined)
How it verifies the fix:
The test directly addresses issue #2271 by compiling the norm module without FP8 support. Before the fix, this would fail with:
error: incomplete type is not allowed
because batchWarpReduceSum in reduceKernelUtils.cuh depends on PackType, which is only defined when ENABLE_FP8 is set.
With the fix (guarding batchWarpReduceSum with #ifdef ENABLE_FP8), the compilation succeeds because the problematic code is excluded when FP8 is disabled.
Commit: f694301
--- |
Add test_norm_compilation_without_fp8() to verify that the norm module can be successfully compiled without the ENABLE_FP8 flag. This test ensures that the fix for issue #2271 (guarding batchWarpReduceSum with #ifdef ENABLE_FP8) prevents compilation errors when FP8 support is disabled. The test creates a JIT spec without -DENABLE_FP8 and verifies that compilation succeeds. Without the fix, this would fail with 'incomplete type' error for PackType. Co-authored-by: Zihao Ye <yzh119@users.noreply.github.com>
Code ReviewThank you for this fix! The PR correctly addresses issue #2271 by guarding the unused ✅ Strengths
🔍 Minor Observations
📝 Test Verification SuggestionTo ensure the test actually catches the regression, you could verify that removing the ✅ VerdictLGTM - This is a clean, well-tested fix that solves the reported issue without introducing any regressions. The test ensures the problem won't resurface. Style ComplianceThe changes follow FlashInfer conventions from CLAUDE.md:
Great work! 🚀 |
|
/bot run |
|
[SUCCESS] Pipeline #41498105: 17/20 passed |
Fixes #2271
The
batchWarpReduceSumfunction inreduceKernelUtils.cuhdepends on thePackTypetemplate which is only defined whenENABLE_FP8is set. This causes compilation errors when includingnorm.cuhwithoutENABLE_FP8.Since
batchWarpReduceSumis unused (dead code), guard it with#ifdef ENABLE_FP8to prevent compilation errors.Changes
#ifdef ENABLE_FP8guards aroundbatchWarpReduceSumininclude/flashinfer/trtllm/common/reduceKernelUtils.cuh#ifdef ENABLE_FP8guards aroundbatchWarpReduceSumincsrc/nv_internal/tensorrt_llm/common/reduceKernelUtils.cuh🤖 Generated with Claude Code
Summary by CodeRabbit
Refactor
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.