Fix MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] #2994
Fix MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] #2994aleozlx merged 32 commits intoflashinfer-ai:mainfrom
Conversation
Added torch.cuda.synchronize() drain after failed tactic probes to clear sticky async CUDA errors (e.g. cudaErrorIllegalInstruction from failed TMA WS GEMM probes) before they surface during CUDA graph capture. Signed-off-by: Andrii Skliar <askliar@nvidia.com>
- Introduced `get_candidate_configs_sm121` function to handle GEMM configurations for the SM121 architecture, which has a reduced shared memory budget. - Updated `generate_sm120_grouped_gemm_operations` to accommodate the specific tile size constraints for SM121. - Enhanced `CompilationContext` to differentiate between SM120 and SM121 in the JIT cache. - Adjusted kernel generation logic to ensure compatibility with the new architecture. Signed-off-by: Andrii Skliar <askliar@nvidia.com>
- Implemented `gen_cutlass_fused_moe_sm121_module` to generate modules for the SM121 architecture, ensuring compatibility with its shared memory constraints. - Updated the `get_cutlass_fused_moe_module` function to handle the new SM121 backend. - Refactored `get_candidate_configs_sm121` to streamline GEMM configuration retrieval. This enhances the framework's capability to leverage the SM121 architecture effectively.
- Removed `gen_cutlass_fused_moe_sm121_module` and its references from the codebase, simplifying the architecture support. - Updated `get_cutlass_fused_moe_module` to handle only SM120 and SM103 backends. - Adjusted kernel generation logic to ensure compatibility with the remaining architectures. This change streamlines the code and focuses on maintaining support for the more widely used architectures.
- Introduced a filtering mechanism in `get_candidate_tiles` to exclude tile configurations where both M and N are greater than or equal to 128 for the SM121 architecture, addressing shared memory constraints. - Updated the return statements for various GEMM types to utilize the new filtering function, ensuring the autotuner does not consider invalid configurations. This change enhances the efficiency of the autotuner by preventing it from evaluating known-bad configurations for SM121.
- Introduced a `skipped_count` variable to track the number of unsupported tactics during profiling in the `AutoTuner` class. - Added logging to inform users when tactics are skipped, enhancing visibility into the autotuning process. This change improves the debugging experience by providing insights into the profiling process and potential issues with unsupported tactics.
…ility - Updated the `TileShape` structure to include a third dimension `k` for various tile configurations, ensuring accurate representation of tile shapes in the `get_cta_shape_for_config` function. - Removed the filtering mechanism for large tile configurations specific to SM121, simplifying the candidate tile retrieval process across different GEMM types. - Adjusted return statements in `get_candidate_tiles` to directly return valid configurations without filtering, enhancing the efficiency of the autotuner. This change improves the flexibility and accuracy of tile shape configurations in CUTLASS, facilitating better performance across supported architectures.
…imits - Introduced a new function `tile_fits_smem` to evaluate if tile configurations fit within the shared memory limits for different architectures, improving memory management. - Updated `get_candidate_configs_sm120` to include the new shared memory fitting logic, ensuring only valid configurations are considered for SM120. - Adjusted the `get_candidate_configs` function to utilize the shared memory fitting check, enhancing the robustness of GEMM configuration retrieval. This change optimizes the autotuning process by preventing the selection of configurations that exceed shared memory constraints, leading to better performance across supported architectures.
- Updated the shared memory limit for SM 8.6, 8.9, 12.x from 102400 bytes to 101376 bytes to reflect accurate constraints. - Cleaned up logging statements in `get_candidate_configs_sm120` and `get_candidate_configs` for better readability and consistency. This change ensures that the shared memory calculations are precise, enhancing the reliability of GEMM configuration evaluations.
…heuristic.cpp Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…re/sm121-tile-filtering
…re/sm121-tile-filtering
…re/sm121-tile-filtering
This commit removes the `tile_fits_smem` function from the `cutlass_heuristic.cpp` file, which was responsible for checking if a given tile and stage pair fit within the device shared memory limit. The logic for this check has been deemed unnecessary for the current candidate configuration functions. Additionally, the `compilation_context.py` file has been updated to clarify the suffix handling for SM 12.x architectures, ensuring that each variant is correctly represented. The `autotuner.py` file has also been modified to include error handling for CUDA operations, improving robustness during profiling. Overall, these changes streamline the code and enhance error management.
This commit modifies the tile configuration in the `cutlass_heuristic.cpp` file by changing the structure of the `all_tiles` array. The K dimension has been removed from the configuration, simplifying the tile representation to only include the tile enumeration and dimensions M and N. This change streamlines the candidate configuration process for SM120 GEMM operations.
This commit updates the `cutlass_heuristic.cpp` file by removing the K dimension from the TileShape structure, streamlining the tile configuration for various CutlassTileConfig cases. The changes enhance the clarity and efficiency of the candidate configuration process for GEMM operations, particularly for SM120, by focusing solely on the M and N dimensions.
This commit further simplifies the tile shape configuration in `cutlass_heuristic.cpp` by removing unnecessary return statements and streamlining the candidate configuration logic. The changes enhance code clarity and maintain the focus on M and N dimensions, aligning with previous refactoring efforts to optimize GEMM operations for SM120.
…heuristic.cpp This commit updates the candidate configuration logic in `cutlass_heuristic.cpp` for SM120 by introducing additional tile shapes based on the `GROUPED_GEMM` configuration. The changes provide a more comprehensive set of configurations, improving flexibility and performance for GEMM operations. The logic now distinguishes between grouped and non-grouped configurations, ensuring appropriate tile shapes are returned based on the input parameters.
…lass_heuristic.cpp This commit eliminates outdated candidate configurations related to the `FP4_ONLY` and `GROUPED_GEMM` settings in `cutlass_heuristic.cpp`. The removal streamlines the candidate configuration logic, focusing on relevant tile shapes and enhancing code clarity for GEMM operations on SM120.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
618-622: Pre-sizeresultfor the fixed tile table.Tiny cleanup: reserve capacity before the push loop to avoid reallocations.
♻️ Suggested tweak
std::vector<CutlassGemmConfig> result; + result.reserve(std::size(all_tiles)); for (auto tile : all_tiles) { result.push_back(CutlassGemmConfig{tile, MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1}); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp` around lines 618 - 622, The vector 'result' should pre-size or reserve capacity to avoid repeated reallocations when populating from 'all_tiles'; before the loop that pushes CutlassGemmConfig entries, call reserve(all_tiles.size()) (or resize and assign) on 'result' so the push_back loop that creates CutlassGemmConfig{tile, MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO, ClusterShape::ClusterShape_1x1x1} does not trigger reallocations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`:
- Around line 618-622: The vector 'result' should pre-size or reserve capacity
to avoid repeated reallocations when populating from 'all_tiles'; before the
loop that pushes CutlassGemmConfig entries, call reserve(all_tiles.size()) (or
resize and assign) on 'result' so the push_back loop that creates
CutlassGemmConfig{tile, MainloopScheduleType::AUTO, EpilogueScheduleType::AUTO,
ClusterShape::ClusterShape_1x1x1} does not trigger reallocations.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bd12e213-32c3-4fe8-a3f5-5ff4655a7279
📒 Files selected for processing (2)
csrc/nv_internal/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppflashinfer/compilation_context.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/compilation_context.py
|
/bot run |
|
tests look good |
|
public CI seemed cancelled for some reason. restarted and waiting for auto merge |
|
wait, seems the pre-commit check has failed. pls address that by re-running pre-commit |
…re/sm121-tile-filtering
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
🧹 Nitpick comments (1)
flashinfer/comm/allreduce.py (1)
729-729: Consider exposingrouted_scaling_factoras a function parameter.This is hardcoded to
Nonewith no way for callers to pass a different value, unlike other optional MOE Finalize parameters (expert_scale_factor,shared_expert_output) which are exposed in the function signature. If this is intentional (feature not yet ready), a brief comment would help clarify.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/comm/allreduce.py` at line 729, The call currently hardcodes routed_scaling_factor=None; expose routed_scaling_factor as an optional parameter (default None) on the same function that already accepts expert_scale_factor and shared_expert_output, add it to the function signature and docstring, and forward that parameter into the call (replacing routed_scaling_factor=None with routed_scaling_factor=routed_scaling_factor) so callers can override it; if leaving it fixed was intentional instead, add a short comment next to the call explaining why it must remain None.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@flashinfer/comm/allreduce.py`:
- Line 729: The call currently hardcodes routed_scaling_factor=None; expose
routed_scaling_factor as an optional parameter (default None) on the same
function that already accepts expert_scale_factor and shared_expert_output, add
it to the function signature and docstring, and forward that parameter into the
call (replacing routed_scaling_factor=None with
routed_scaling_factor=routed_scaling_factor) so callers can override it; if
leaving it fixed was intentional instead, add a short comment next to the call
explaining why it must remain None.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 92c59e3f-ce80-4fbe-a432-65aa3cc6a0b9
📒 Files selected for processing (4)
flashinfer/aot.pyflashinfer/autotuner.pyflashinfer/comm/allreduce.pyflashinfer/jit/core.py
✅ Files skipped from review due to trivial changes (3)
- flashinfer/jit/core.py
- flashinfer/aot.py
- flashinfer/autotuner.py
Head branch was pushed to by a user without write access
|
@aleozlx I have looked more into the pre-commit changes - those are also on main. I will do a separate PR. |
862593c to
216802d
Compare
|
/bot run |
Problem
MXFP4 and MXFP8 GEMM operations were failing on SM120 because:
leaving the autotuner with no viable candidate in some cases
Fix
workloads
Summary by CodeRabbit
Performance & Optimizations
Build/Compatibility
Chores
Bug Fixes
Co-authored-by: samuellees lsam@nvidia.com