Issue Description
Problem
On AMD GPU + ROCm (HIP backend), convolution kernels (conv2d, conv3d, depthwise_conv2d) are not registered for bfloat16 precision. This causes a runtime error when attempting to run BF16 models with convolution operations:
RuntimeError: (NotFound) The kernel with key (GPU, Undefined(AnyLayout), bfloat16) of kernel `conv2d` is not registered and fail to fallback to CPU one. Selected wrong DataType `bfloat16`. Paddle support following DataTypes: float64, float32.
[Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /work/Paddle/paddle/phi/core/kernel_factory.cc:380)
Root Cause
In paddle/phi/kernels/gpudnn/conv_kernel.cu, the HIP branch of PD_REGISTER_KERNEL only registers float and phi::float16:
#ifdef PADDLE_WITH_HIP
PD_REGISTER_KERNEL(conv2d, GPUDNN, ALL_LAYOUT, phi::ConvCudnnKernel, float, phi::float16) {}
PD_REGISTER_KERNEL(conv3d, GPUDNN, ALL_LAYOUT, phi::Conv3DCudnnKernel, float, phi::float16) {}
PD_REGISTER_KERNEL(depthwise_conv2d, GPUDNN, ALL_LAYOUT, phi::DepthwiseConvCudnnKernel, float, phi::float16) {}
#else
// CUDA branch already includes phi::bfloat16 for CUDNN_VERSION_MIN(8, 1, 0)
#endif
The same issue exists in paddle/phi/kernels/gpudnn/conv_grad_kernel.cu for backward kernels (conv2d_grad, conv3d_grad, etc.).
At the end of conv_kernel.cu there's even a comment: // todo register bfloat16.
Impact
This significantly impacts BF16 inference on AMD GPUs:
- PaddleOCR-VL-1.5: The vision encoder (SigLIP) uses
conv2d; cannot run in BF16 on AMD GPU
- Memory overhead: FP32 fallback doubles VRAM usage compared to BF16
- Performance loss: Cannot leverage AMD GPU BF16 compute throughput (MI300X has native BF16 Tensor Core support)
- Ecosystem: Affects all LLMs and multimodal models using BF16 on ROCm
Reproduction
Environment: AMD GPU (MI300X), ROCm 7.0.51, PaddlePaddle 3.4.0.dev20260123
import paddle
print("ROCm:", paddle.is_compiled_with_rocm()) # True
paddle.set_device("gpu")
x = paddle.randn([1, 3, 64, 64]).astype("bfloat16")
w = paddle.randn([8, 3, 3, 3]).astype("bfloat16")
y = paddle.nn.functional.conv2d(x, w) # RuntimeError: kernel not registered
Current Workaround
Downstream projects (e.g., PaddleX) work around this by:
- Returning
False from is_bfloat16_available() on ROCm in paddlex/inference/utils/misc.py
- Forcing the entire visual encoder to FP32 via
_keep_in_fp32_modules = ["visual", "mlp_AR"]
- Disabling conv2d fusion passes (
conv2d_add_act_fuse_pass, conv2d_add_fuse_pass)
Proposed Fix
Add phi::bfloat16 to the HIP kernel registration macros in:
paddle/phi/kernels/gpudnn/conv_kernel.cu: conv2d, conv3d, depthwise_conv2d
paddle/phi/kernels/gpudnn/conv_grad_kernel.cu: conv2d_grad, conv3d_grad, conv2d_double_grad, conv3d_double_grad, depthwise_conv2d_double_grad
The underlying MIOpen infrastructure already supports BF16 for convolutions (miopenBFloat16 is mapped in miopen_helper.h, and miopenConvolutionForward is called in the HIP path).
Note
Per MIOpen datatypes documentation, miopenBFloat16 is fully supported for convolutions, tensor set, and tensor copy. Non-convolution BF16 ops (softmax, pooling, batchnorm, etc.) may need separate fallback handling but are out of scope for this fix.
Issue Description
Problem
On AMD GPU + ROCm (HIP backend), convolution kernels (
conv2d,conv3d,depthwise_conv2d) are not registered for bfloat16 precision. This causes a runtime error when attempting to run BF16 models with convolution operations:Root Cause
In
paddle/phi/kernels/gpudnn/conv_kernel.cu, the HIP branch ofPD_REGISTER_KERNELonly registersfloatandphi::float16:The same issue exists in
paddle/phi/kernels/gpudnn/conv_grad_kernel.cufor backward kernels (conv2d_grad,conv3d_grad, etc.).At the end of
conv_kernel.cuthere's even a comment:// todo register bfloat16.Impact
This significantly impacts BF16 inference on AMD GPUs:
conv2d; cannot run in BF16 on AMD GPUReproduction
Environment: AMD GPU (MI300X), ROCm 7.0.51, PaddlePaddle 3.4.0.dev20260123
Current Workaround
Downstream projects (e.g., PaddleX) work around this by:
Falsefromis_bfloat16_available()on ROCm inpaddlex/inference/utils/misc.py_keep_in_fp32_modules = ["visual", "mlp_AR"]conv2d_add_act_fuse_pass,conv2d_add_fuse_pass)Proposed Fix
Add
phi::bfloat16to the HIP kernel registration macros in:paddle/phi/kernels/gpudnn/conv_kernel.cu:conv2d,conv3d,depthwise_conv2dpaddle/phi/kernels/gpudnn/conv_grad_kernel.cu:conv2d_grad,conv3d_grad,conv2d_double_grad,conv3d_double_grad,depthwise_conv2d_double_gradThe underlying MIOpen infrastructure already supports BF16 for convolutions (
miopenBFloat16is mapped inmiopen_helper.h, andmiopenConvolutionForwardis called in the HIP path).Note
Per MIOpen datatypes documentation,
miopenBFloat16is fully supported for convolutions, tensor set, and tensor copy. Non-convolution BF16 ops (softmax, pooling, batchnorm, etc.) may need separate fallback handling but are out of scope for this fix.