[CI]【Hackathon 10th Spring No.45-part2】Add SM75/SM80 compile guards for cutlass and MoE tail ops#7504
Open
bobby-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
Conversation
…or cutlass and MoE tail ops
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-20 14:03 CST
📋 Review 摘要
PR 概述:为 cpp_extensions.cc 中 12 个遗漏的 pybind11 op 注册添加 SM75+/SM80+ 条件编译 guard,修复在 T4/V100 等低算力 GPU 上的 undefined symbol 链接错误。
变更范围:custom_ops/gpu_ops/cpp_extensions.cc(+8 行,-13 行净变更)
影响面 Tag:OP CI
问题
未发现阻塞性问题。
✅ 验证项
| 检查点 | 结果 |
|---|---|
Guard 宏与已有模式一致(ENABLE_SM75_EXT_OPS / ENABLE_SM80_EXT_OPS) |
✅ 与 PR #6488 定义的宏匹配 |
#if* / #endif 配对平衡 |
✅ 18/18 平衡 |
| Python 调用层兼容性 | ✅ 所有被 guard 的 ops 在 Python 层均通过动态 import 引入,低 SM 硬件上不会触发未定义符号 |
| 5 个 SM75+ ops 覆盖完整 | ✅ cutlass_scaled_mm、cutlass_scaled_mm_azp、static_scaled_fp8_quant、dynamic_scaled_fp8_quant、dynamic_per_token_scaled_fp8_quant |
| 7 个 SM80+ ops 覆盖完整 | ✅ prefill_permute_to_masked_gemm、depermute_prefill_combine、radix_topk_ragged_transform、dsk_attn_write_cache、indexer_k_quant_and_cache、cp_gather_indexer_k_quant_cache、per_token_group_fp8_quant |
总体评价
改动范围清晰、逻辑正确,是对 PR #6488 的有效补充。仅涉及 pybind11 模块注册块的条件编译 guard 插入,不改变任何 kernel 逻辑或运行时行为,SM80+ 硬件上行为与变更前完全一致。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PR #6488 added conditional compilation guards to
cpp_extensions.ccfor most SM80+ and SM75+ ops, enabling FastDeploy to compile and run on T4 (SM75) and V100 (SM70) hardware. However, 12 ops added in later commits (post-PR #6488 branch point) were not covered by those guards:cutlass_scaled_mm,cutlass_scaled_mm_azp,static_scaled_fp8_quant,dynamic_scaled_fp8_quant,dynamic_per_token_scaled_fp8_quant) — only compiled whenENABLE_SCALED_MM_C2X/ENABLE_SM75_EXT_OPSis set (SM75+), yet registered unconditionally in the moduleprefill_permute_to_masked_gemm,depermute_prefill_combine,radix_topk_ragged_transform,dsk_attn_write_cache,indexer_k_quant_and_cache,cp_gather_indexer_k_quant_cache,per_token_group_fp8_quant) — from DSA/MoE kernel sources only compiled at SM80+, but missing guards in the pybind moduleWithout these guards,
python setup.py build_extfails with undefined symbol link errors on T4/V100 targets.Modifications
Single file changed:
custom_ops/gpu_ops/cpp_extensions.cc(+8 lines, -13 lines net)Block 1 — SM75+ ops (wraps 5 ops with
#ifdef ENABLE_SM75_EXT_OPS/#endif):cutlass_scaled_mmw8a8/scaled_mm_entry.cucutlass_scaled_mm_azpw8a8/scaled_mm_entry.custatic_scaled_fp8_quantquantization/common.cudynamic_scaled_fp8_quantquantization/common.cudynamic_per_token_scaled_fp8_quantquantization/common.cuBlock 2 — SM80+ tail ops (wraps 7 ops with
#ifdef ENABLE_SM80_EXT_OPS/#endif):prefill_permute_to_masked_gemmmoe/prefill_permute_to_masked_gemm.cudepermute_prefill_combinemoe/depermute_prefill_combine.curadix_topk_ragged_transformsparse_indexer/indexer_topk.cudsk_attn_write_cacheappend_attn/ds_mla_cache_kernel.cuindexer_k_quant_and_cacheappend_attn/ds_mla_cache_kernel.cucp_gather_indexer_k_quant_cacheappend_attn/ds_mla_cache_kernel.cuper_token_group_fp8_quantappend_attn/ds_mla_cache_kernel.cuGuard balance after this PR:
#if*=18, #endif=18— balanced.Usage or Command
Guard activation is driven by the compile macros already defined in
setup_ops.pyby PR #6488 (ENABLE_SM75_EXT_OPSat cc≥75,ENABLE_SM80_EXT_OPSat cc≥80). No changes tosetup_ops.pyare needed in this PR.Accuracy Tests
This PR only modifies conditional compilation guards in the pybind11 module registration block. No kernel logic, numerical computation, or runtime dispatch is altered. On SM80+ hardware (where all macros are defined), the full op set remains registered and behavior is identical to before this PR.
Checklist
#ifdef/#endifguard insertionENABLE_SM75_EXT_OPS,ENABLE_SM80_EXT_OPS)#if*=18, #endif=18)