Skip to content

feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587

Merged
luotao1 merged 3 commits intoPaddlePaddle:developfrom
fchange:feat/hip-bf16-conv-kernel
Apr 13, 2026
Merged

feat(HIP): register bfloat16 kernels for conv2d/conv3d/depthwise_conv2d on ROCm#78587
luotao1 merged 3 commits intoPaddlePaddle:developfrom
fchange:feat/hip-bf16-conv-kernel

Conversation

@fchange
Copy link
Copy Markdown
Contributor

@fchange fchange commented Apr 4, 2026

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

This PR registers phi::bfloat16 convolution kernels for the HIP/ROCm backend so BF16 vision models can run on AMD GPUs without hitting kernel not registered for convolution ops.

Changes in this PR:

  • Add phi::bfloat16 to HIP kernel registration for conv2d, conv3d, and depthwise_conv2d in paddle/phi/kernels/gpudnn/conv_kernel.cu
  • Add phi::bfloat16 to HIP kernel registration for conv2d_grad, conv3d_grad, conv2d_double_grad, conv3d_double_grad, and depthwise_conv2d_double_grad in paddle/phi/kernels/gpudnn/conv_grad_kernel.cu
  • Add a HIP BF16 regression test focused on minimal convolution smoke coverage

Follow-up adjustment:

  • Narrow test_hip_bf16_conv_kernel.py to minimal conv2d and grouped-conv smoke tests
  • Remove larger BF16 operator-chain coverage from this PR so Linux-DCU validates kernel registration only and does not gate on unrelated BF16 op behavior

Manual verification used for the original change:

  • Reproduced the pre-fix error: kernel conv2d not registered for BF16 on HIP
  • Verified the patched Paddle build allows BF16 convolution to execute on AMD GPU
  • Verified PaddleX PaddleOCR-VL native backend inference can proceed past the BF16 convolution registration failure on AMD GPU

是否引起精度变化

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 4, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 4, 2026

CLA assistant check
All committers have signed the CLA.

@paddle-bot paddle-bot bot added the contributor External developers label Apr 4, 2026
fchange added a commit to fchange/PaddleX that referenced this pull request Apr 4, 2026
Remove _keep_in_fp32_modules = ["visual", "mlp_AR"] from
PaddleOCRVLForConditionalGeneration. This workaround was added to
avoid MIOpen BF16 convolution bugs on ROCm 7.0 by forcing the visual
encoder to FP32, which doubled VRAM usage and reduced throughput.

The Paddle framework now registers BF16 conv kernels for HIP backend,
making this workaround unnecessary.

See: PaddlePaddle/Paddle#78587

Signed-off-by: fchange

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
yongqiangma
yongqiangma previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Contributor

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…2d on HIP

Add phi::bfloat16 to PD_REGISTER_KERNEL macros for HIP (ROCm) backend:
- conv2d, conv3d, depthwise_conv2d (forward)
- conv2d_grad, conv3d_grad (backward)
- conv2d_double_grad, conv3d_double_grad, depthwise_conv2d_double_grad

This enables BF16 precision inference for vision encoders (e.g., SigLIP
in PaddleOCR-VL) on AMD GPUs. Previously only float and float16 were
registered for HIP, causing RuntimeError when BF16 models attempted
convolution operations.

Also adds test_hip_bf16_conv_kernel.py to verify BF16 conv kernel
registration on HIP/ROCm platforms.

Fixes: conv2d BF16 kernel not registered on HIP
Signed-off-by: fchange

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
@luotao1
Copy link
Copy Markdown
Contributor

luotao1 commented Apr 9, 2026

其他的CI不需要关注,DCU的test需要修复
image

The ToCudnnDataType function in miopen_desc.h was missing a case for
DataType::BFLOAT16, causing it to fall through and return miopenFloat
(FP32) instead of miopenBFloat16 for BF16 tensors.

This led to MIOPEN using FP32 tensor descriptors for BF16 data, which
produced NaN output in conv2d/conv3d operations and caused the DCU CI
test_hip_bf16_conv_kernel to fail.

Fixes: test_hip_bf16_conv_kernel failure on Linux-DCU CI.
@fchange
Copy link
Copy Markdown
Contributor Author

fchange commented Apr 11, 2026

其他的CI不需要关注,DCU的test需要修复 image

已经完成啦

@luotao1 luotao1 closed this Apr 13, 2026
@luotao1 luotao1 reopened this Apr 13, 2026
Copy link
Copy Markdown
Contributor

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 merged commit 7ead32d into PaddlePaddle:develop Apr 13, 2026
114 of 116 checks passed
@luotao1
Copy link
Copy Markdown
Contributor

luotao1 commented Apr 13, 2026

hi, @fchange

  • 非常感谢你对飞桨的贡献,我们正在运营一个PFCC组织。PFCC是飞桨开源的贡献者俱乐部,只有给飞桨合入过代码的开发者才能加入,俱乐部里每两周会有一次例会(按兴趣参加),也会时不时办线下meetup面基,详情可见 https://github.com/luotao1 主页说明。
  • 如果你对PFCC有兴趣,请发送邮件至 ext_paddle_oss@baidu.com,我们会邀请你加入~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants