Skip to content

[CI]【Hackathon 10th Spring No.45-part2】Add SM75/SM80 compile guards for cutlass and MoE tail ops#7504

Open
bobby-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-45-sm-tier-compile-guards1
Open

[CI]【Hackathon 10th Spring No.45-part2】Add SM75/SM80 compile guards for cutlass and MoE tail ops#7504
bobby-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
CloudForge-Solutions:task/h10-45-sm-tier-compile-guards1

Conversation

@bobby-cloudforge
Copy link
Copy Markdown

@bobby-cloudforge bobby-cloudforge commented Apr 20, 2026

Motivation

PR #6488 added conditional compilation guards to cpp_extensions.cc for most SM80+ and SM75+ ops, enabling FastDeploy to compile and run on T4 (SM75) and V100 (SM70) hardware. However, 12 ops added in later commits (post-PR #6488 branch point) were not covered by those guards:

  • 5 SM75+ ops (cutlass_scaled_mm, cutlass_scaled_mm_azp, static_scaled_fp8_quant, dynamic_scaled_fp8_quant, dynamic_per_token_scaled_fp8_quant) — only compiled when ENABLE_SCALED_MM_C2X / ENABLE_SM75_EXT_OPS is set (SM75+), yet registered unconditionally in the module
  • 7 SM80+ ops (prefill_permute_to_masked_gemm, depermute_prefill_combine, radix_topk_ragged_transform, dsk_attn_write_cache, indexer_k_quant_and_cache, cp_gather_indexer_k_quant_cache, per_token_group_fp8_quant) — from DSA/MoE kernel sources only compiled at SM80+, but missing guards in the pybind module

Without these guards, python setup.py build_ext fails with undefined symbol link errors on T4/V100 targets.

Modifications

Single file changed: custom_ops/gpu_ops/cpp_extensions.cc (+8 lines, -13 lines net)

Block 1 — SM75+ ops (wraps 5 ops with #ifdef ENABLE_SM75_EXT_OPS / #endif):

Op Source
cutlass_scaled_mm w8a8/scaled_mm_entry.cu
cutlass_scaled_mm_azp w8a8/scaled_mm_entry.cu
static_scaled_fp8_quant quantization/common.cu
dynamic_scaled_fp8_quant quantization/common.cu
dynamic_per_token_scaled_fp8_quant quantization/common.cu

Block 2 — SM80+ tail ops (wraps 7 ops with #ifdef ENABLE_SM80_EXT_OPS / #endif):

Op Source
prefill_permute_to_masked_gemm moe/prefill_permute_to_masked_gemm.cu
depermute_prefill_combine moe/depermute_prefill_combine.cu
radix_topk_ragged_transform sparse_indexer/indexer_topk.cu
dsk_attn_write_cache append_attn/ds_mla_cache_kernel.cu
indexer_k_quant_and_cache append_attn/ds_mla_cache_kernel.cu
cp_gather_indexer_k_quant_cache append_attn/ds_mla_cache_kernel.cu
per_token_group_fp8_quant append_attn/ds_mla_cache_kernel.cu

Guard balance after this PR: #if*=18, #endif=18 — balanced.

Usage or Command

# Build for V100 (SM70) — SM75 and SM80 op groups excluded
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

# Build for T4 (SM75) — SM80 op group excluded, SM75 quant ops included
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

# Build for A100+ (SM80+) — all op groups included
CUDA_VISIBLE_DEVICES=0 python setup.py build_ext --inplace

Guard activation is driven by the compile macros already defined in setup_ops.py by PR #6488 (ENABLE_SM75_EXT_OPS at cc≥75, ENABLE_SM80_EXT_OPS at cc≥80). No changes to setup_ops.py are needed in this PR.

Accuracy Tests

This PR only modifies conditional compilation guards in the pybind11 module registration block. No kernel logic, numerical computation, or runtime dispatch is altered. On SM80+ hardware (where all macros are defined), the full op set remains registered and behavior is identical to before this PR.

Checklist

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 20, 2026

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Apr 20, 2026
Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-20 14:03 CST

📋 Review 摘要

PR 概述:为 cpp_extensions.cc 中 12 个遗漏的 pybind11 op 注册添加 SM75+/SM80+ 条件编译 guard,修复在 T4/V100 等低算力 GPU 上的 undefined symbol 链接错误。
变更范围custom_ops/gpu_ops/cpp_extensions.cc(+8 行,-13 行净变更)
影响面 TagOP CI

问题

未发现阻塞性问题。

✅ 验证项

检查点 结果
Guard 宏与已有模式一致(ENABLE_SM75_EXT_OPS / ENABLE_SM80_EXT_OPS ✅ 与 PR #6488 定义的宏匹配
#if* / #endif 配对平衡 ✅ 18/18 平衡
Python 调用层兼容性 ✅ 所有被 guard 的 ops 在 Python 层均通过动态 import 引入,低 SM 硬件上不会触发未定义符号
5 个 SM75+ ops 覆盖完整 cutlass_scaled_mmcutlass_scaled_mm_azpstatic_scaled_fp8_quantdynamic_scaled_fp8_quantdynamic_per_token_scaled_fp8_quant
7 个 SM80+ ops 覆盖完整 prefill_permute_to_masked_gemmdepermute_prefill_combineradix_topk_ragged_transformdsk_attn_write_cacheindexer_k_quant_and_cachecp_gather_indexer_k_quant_cacheper_token_group_fp8_quant

总体评价

改动范围清晰、逻辑正确,是对 PR #6488 的有效补充。仅涉及 pybind11 模块注册块的条件编译 guard 插入,不改变任何 kernel 逻辑或运行时行为,SM80+ 硬件上行为与变更前完全一致。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants