Skip to content

HIP/ROCm: conv2d/conv3d/depthwise_conv2d kernels missing bfloat16 registration #78586

@fchange

Description

@fchange

Issue Description

Problem

On AMD GPU + ROCm (HIP backend), convolution kernels (conv2d, conv3d, depthwise_conv2d) are not registered for bfloat16 precision. This causes a runtime error when attempting to run BF16 models with convolution operations:

RuntimeError: (NotFound) The kernel with key (GPU, Undefined(AnyLayout), bfloat16) of kernel `conv2d` is not registered and fail to fallback to CPU one. Selected wrong DataType `bfloat16`. Paddle support following DataTypes: float64, float32.
  [Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /work/Paddle/paddle/phi/core/kernel_factory.cc:380)

Root Cause

In paddle/phi/kernels/gpudnn/conv_kernel.cu, the HIP branch of PD_REGISTER_KERNEL only registers float and phi::float16:

#ifdef PADDLE_WITH_HIP
PD_REGISTER_KERNEL(conv2d, GPUDNN, ALL_LAYOUT, phi::ConvCudnnKernel, float, phi::float16) {}
PD_REGISTER_KERNEL(conv3d, GPUDNN, ALL_LAYOUT, phi::Conv3DCudnnKernel, float, phi::float16) {}
PD_REGISTER_KERNEL(depthwise_conv2d, GPUDNN, ALL_LAYOUT, phi::DepthwiseConvCudnnKernel, float, phi::float16) {}
#else
// CUDA branch already includes phi::bfloat16 for CUDNN_VERSION_MIN(8, 1, 0)
#endif

The same issue exists in paddle/phi/kernels/gpudnn/conv_grad_kernel.cu for backward kernels (conv2d_grad, conv3d_grad, etc.).

At the end of conv_kernel.cu there's even a comment: // todo register bfloat16.

Impact

This significantly impacts BF16 inference on AMD GPUs:

  • PaddleOCR-VL-1.5: The vision encoder (SigLIP) uses conv2d; cannot run in BF16 on AMD GPU
  • Memory overhead: FP32 fallback doubles VRAM usage compared to BF16
  • Performance loss: Cannot leverage AMD GPU BF16 compute throughput (MI300X has native BF16 Tensor Core support)
  • Ecosystem: Affects all LLMs and multimodal models using BF16 on ROCm

Reproduction

Environment: AMD GPU (MI300X), ROCm 7.0.51, PaddlePaddle 3.4.0.dev20260123

import paddle
print("ROCm:", paddle.is_compiled_with_rocm())  # True

paddle.set_device("gpu")
x = paddle.randn([1, 3, 64, 64]).astype("bfloat16")
w = paddle.randn([8, 3, 3, 3]).astype("bfloat16")
y = paddle.nn.functional.conv2d(x, w)  # RuntimeError: kernel not registered

Current Workaround

Downstream projects (e.g., PaddleX) work around this by:

  1. Returning False from is_bfloat16_available() on ROCm in paddlex/inference/utils/misc.py
  2. Forcing the entire visual encoder to FP32 via _keep_in_fp32_modules = ["visual", "mlp_AR"]
  3. Disabling conv2d fusion passes (conv2d_add_act_fuse_pass, conv2d_add_fuse_pass)

Proposed Fix

Add phi::bfloat16 to the HIP kernel registration macros in:

  • paddle/phi/kernels/gpudnn/conv_kernel.cu: conv2d, conv3d, depthwise_conv2d
  • paddle/phi/kernels/gpudnn/conv_grad_kernel.cu: conv2d_grad, conv3d_grad, conv2d_double_grad, conv3d_double_grad, depthwise_conv2d_double_grad

The underlying MIOpen infrastructure already supports BF16 for convolutions (miopenBFloat16 is mapped in miopen_helper.h, and miopenConvolutionForward is called in the HIP path).

Note

Per MIOpen datatypes documentation, miopenBFloat16 is fully supported for convolutions, tensor set, and tensor copy. Non-convolution BF16 ops (softmax, pooling, batchnorm, etc.) may need separate fallback handling but are out of scope for this fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions