[ROCm] 适配 HIP BF16: 注册 BF16 layer_norm、绕开 MIOpen BF16 softmax、HIP 跳过 cuDNN-only conv2d 融合 pass#78711
Open
austin1997 wants to merge 3 commits intoPaddlePaddle:developfrom
Open
[ROCm] 适配 HIP BF16: 注册 BF16 layer_norm、绕开 MIOpen BF16 softmax、HIP 跳过 cuDNN-only conv2d 融合 pass#78711austin1997 wants to merge 3 commits intoPaddlePaddle:developfrom
austin1997 wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
Add phi::bfloat16 to the layer_norm / layer_norm_grad kernel registrations under PADDLE_WITH_HIP so the existing templated implementation is exposed for BF16 inputs on ROCm. Matches the FLOAT16 treatment of the mean/variance output dtype (promoted to FLOAT32 for numerical stability). Unblocks BF16 inference of the PaddleOCR-VL-1.5 SigLIP-style vision encoder on MI300X (gfx942), which previously required PaddleX to keep the whole visual + mlp_AR subgraph in FP32 via _keep_in_fp32_modules.
MIOpen (as of ROCm 7.x) returns MIOPEN_STATUS_NOT_IMPLEMENTED for miopenSoftmaxForward_V2 with miopenBFloat16, so the gpudnn softmax path cannot be used for BF16 on HIP. When the input dim exceeds the warp softmax cap, route BF16 through the existing matrix softmax kernel instead of letting the call fall into the MIOpen branch. Also gate the CUDNN_VERSION < 8100 BF16 fallback specialization on !defined(PADDLE_WITH_HIP) — that branch dispatched into MIOpen too and would trip the same NOT_IMPLEMENTED failure on ROCm.
conv2d_add_fuse_pass and conv2d_add_act_fuse_pass rewrite conv2d+add[+act]
into the fused_conv2d_add_act op, which has only a cuDNN GPUDNN kernel.
On ROCm the rewrite succeeds but kernel dispatch later fails because no
HIP kernel is registered, so PaddleX currently works around this by
calling config.delete_pass("conv2d_add_act_fuse_pass") and
config.delete_pass("conv2d_add_fuse_pass") under paddle.is_compiled_with_rocm()
in paddlex/inference/models/runners/paddle_static/runner.py.
Gate both the pass registration (REGISTER_IR_PASS / USE_PIR_PASS) and the
pass-builder inclusion on PADDLE_WITH_CUDA so the rewrite never runs on
HIP builds, making the PaddleX delete_pass calls unnecessary.
|
|
1 similar comment
|
|
|
你的PR提交成功,感谢你对开源项目的贡献! |
This was referenced Apr 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Environment Adaptation
PR Types
New features
Description
适配 Paddle 框架在 ROCm/HIP 后端上的 BF16 精度类型,使 PaddleOCR-VL 等含 SigLIP 视觉编码器的 VLM 在 AMD GPU 上无需将视觉子图回退到 FP32 即可原生 BF16 推理。修复 #78710 中描述的三个剩余缺口;与已合入的 #78587(HIP BF16 conv 内核注册)互补,三处改动相互独立,CUDA 行为完全不变。
1)
layer_norm/layer_norm_grad在 HIP 上注册 BF16 内核paddle/phi/kernels/gpu/layer_norm_kernel.cu与layer_norm_grad_kernel.cu的PADDLE_WITH_HIP注册块原本只覆盖float+phi::float16。模板实现 (LayerNormKernel<T, GPUContext>/LayerNormGradKernel<T, GPUContext>) 本身已支持phi::bfloat16,仅缺少注册。补齐注册,并按照 FP16 已有做法把mean/variance输出 dtype 提升到 FP32 以保数值稳定。2) BF16 softmax 走矩阵 kernel,绕开 MIOpen
ROCm 7.x 的
miopenSoftmaxForward_V2对miopenBFloat16返回MIOPEN_STATUS_NOT_IMPLEMENTED。paddle/phi/kernels/gpudnn/softmax_gpudnn.h中当dim超过 warp softmax 阈值后默认调用 MIOpen,会直接 runtime error。本 PR 在SoftmaxForwardCUDAKernelDriverImpl内追加一处PADDLE_WITH_HIP + std::is_same<T, phi::bfloat16>判断,BF16 输入强制走已有的LaunchKeMatrixSoftmaxForwardKernel。同时把CUDNN_VERSION < 8100那条 BF16 fallback specialization 用!defined(PADDLE_WITH_HIP)守起来,避免它在 ROCm 上同样落进 MIOpen。3)
conv2d_add_fuse_pass/conv2d_add_act_fuse_pass在 HIP 上不再注册两个 PIR pass 把
conv2d + add[+ act]改写成fused_conv2d_add_act,而fused_conv2d_add_act只有 cuDNN GPUDNN kernel——这是与 #78587 注册的conv2d/conv3d不同的算子,#78587 不影响该路径。HIP wheel 上 pass 改写成功但执行时 dispatch 不到 kernel,必须由 PaddleX 在 runner 里手动delete_pass(...)才能跑。把
REGISTER_IR_PASS、USE_PIR_PASS以及kPirGpuPasses列表中两个 pass 的引用统一用#ifdef PADDLE_WITH_CUDA包裹。CUDA 行为完全不变;HIP 上 pass 不再存在,PaddleX 端的delete_passworkaround 也成为无操作。测试与验证
单算子层面:
legacy_test/test_layer_norm_op.py、test_softmax_op.py已含 BF16 用例,HIP 编译后可直接复用。端到端:
PaddleOCR-VL-1.5在 AMD MI300X (gfx942) / ROCm 7.2 上以 BF16 完整推理test_ocr.png,输出文本与 FP32-fallback 路径在语义上一致。完整 benchmark 见下方「附录」。CUDA 端三处改动均通过
#ifdef PADDLE_WITH_HIP/#ifdef PADDLE_WITH_CUDA守护,未触及任何 CUDA codepath,行为完全保留。配套 PaddleX 清理
PaddleX 端的
_keep_in_fp32_modules = ["visual", "mlp_AR"]与runner.py中 4 处delete_passworkaround 由 PaddlePaddle/PaddleX#5096 同步移除(部分与 PaddlePaddle/PaddleX#5077 重叠)。附录:BF16 端到端 benchmark(节选自 BF16_BENCHMARK.md)
PaddleOCR-VL-1.5 on ROCm — FP32-fallback vs Native-BF16
End-to-end benchmark of the BF16 framework adaptation task described in TASK.md. Compares the status-quo PaddleX path (
_keep_in_fp32_modules = ["visual", "mlp_AR"]forces the vision tower + multimodal projector to run in FP32 on ROCm) against the same pipeline with that list cleared at runtime so the entire model — vision encoder, projector, and LLM decoder — runs natively in BF16.Environment
/opt/rocm0.0.0on branchrocm7-dev(HEADf2887a57dd) with 4 uncommitted HIP BF16 fixes applied + compiled into the installed wheelPaddleX/(release/3.5)PaddleOCR-VL-native.yaml(batch_size=4096, native genai backend)test_ocr.png(Chinese boarding-pass photo)paddle.device.synchronize()bracketingpipeline.predict(...),time.perf_counterwall-clockRuns are gated by monkey-patching
PaddleOCRVLForConditionalGeneration._keep_in_fp32_modulesbeforecreate_pipeline(...)— no source edits in PaddleX or Paddle beyond the four already-uncommitted framework diffs.End-to-end wall-clock
Output equivalence
Text content is semantically identical between modes. Character-level diff:
"TAIYUAN"and"身份识别ID NO.", which actually matches the original boarding-pass layout better. Remaining content is identical token-for-token; BF16 output is 417 chars vs 416 for FP32 (one extra newline).Verdict: BF16 is at least as correct as FP32-fallback on this document.
Per-op kernel-level breakdown (rocprofv3
--kernel-trace --stats)Totals aggregated across the 4 timed invocations per mode (3 runs + 1 warm-up).
The core finding: native BF16 eliminates 884 ms of FP32 GEMM work per batch (from 18 756 calls down to 1 316). Even after reallocating some of that work to BF16 GEMMs (+511 ms), the net GEMM savings are ~370 ms. Cast/Copy savings (~55 ms) come from no longer round-tripping the vision-encoder activations through FP32 at the boundary.
Top-15 kernels by (FP32 − BF16) delta
Cijk_Ailk_Bljk_SB_MT32x64x64_MI16x16x4x1_…Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_…phi::funcs::VectorizedBroadcastKernel<Add…>Cijk_Alik_Bljk_SB_MT64x32x64_MI16x16x4x1_…Cijk_Ailk_Bljk_SB_MT128x128x32_MI16x16x4x1_…Cijk_Ailk_Bljk_SB_MT64x64x32_MI16x16x4x1_…Eigen::internal::EigenMetaKernel<…>phi::funcs::VectorizedElementwiseKernel<float>Cijk_Ailk_Bljk_SB_MT16x16x64_MI16x16x4x1_…__amd_rocclr_copyBufferphi::funcs::VectorizedBroadcastKernel<Add…>phi::funcs::VectorizedBroadcastKernel<Mul…>phi::funcs::VectorizedElementwiseKernel<float>phi::UnaryElementwiseKernel<ScaleFunctor<float>>phi::funcs::LayerNormForward<float, float, 512, …>SB= single-precision inputs (FP32);BBS= BF16 inputs. The entire column of FP32 GEMMs disappears in native-BF16 mode, replaced by the corresponding BBS variants (captured in theGEMM (BF16 input)row above).Conclusion
The three framework changes — BF16
layer_normregistration, BF16 softmax routed through the matrix kernel (MIOpenmiopenBFloat16isNOT_IMPLEMENTED), and skipping the cuDNN-onlyconv2d_add[_act]_fuse_passon ROCm — are sufficient for PaddleOCR-VL-1.5 to run natively in BF16 on MI300X with no FP32 fallbacks in the vision encoder or multimodal projector, matching the FP32-fallback path's output quality while reducing GPU-kernel time by 11%.The wall-clock improvement on the full pipeline is smaller (4%) because layout detection, tokenisation, and CPU-side postprocessing dominate the end-to-end budget. For LLM-heavy workloads where the VLM decoder is the bottleneck, the kernel-level savings should translate more directly into throughput.
Reproducing
是否引起精度变化
否。三处改动均为 ROCm/HIP 专属:
layer_normBF16 注册沿用了 FP16 的 mean / variance 提升 FP32 策略;delete_pass来达到等价效果。CUDA 行为完全保留。
Closes #78710.