HIP/ROCm: conv2d/conv3d/depthwise_conv2d kernels missing bfloat16 registration

# Issue Description

## Problem

On AMD GPU + ROCm (HIP backend), convolution kernels (`conv2d`, `conv3d`, `depthwise_conv2d`) are **not registered** for **bfloat16** precision. This causes a runtime error when attempting to run BF16 models with convolution operations:

```
RuntimeError: (NotFound) The kernel with key (GPU, Undefined(AnyLayout), bfloat16) of kernel `conv2d` is not registered and fail to fallback to CPU one. Selected wrong DataType `bfloat16`. Paddle support following DataTypes: float64, float32.
  [Hint: Expected kernel_iter != iter->second.end(), but received kernel_iter == iter->second.end().] (at /work/Paddle/paddle/phi/core/kernel_factory.cc:380)
```

## Root Cause

In `paddle/phi/kernels/gpudnn/conv_kernel.cu`, the HIP branch of `PD_REGISTER_KERNEL` only registers `float` and `phi::float16`:

```cpp
#ifdef PADDLE_WITH_HIP
PD_REGISTER_KERNEL(conv2d, GPUDNN, ALL_LAYOUT, phi::ConvCudnnKernel, float, phi::float16) {}
PD_REGISTER_KERNEL(conv3d, GPUDNN, ALL_LAYOUT, phi::Conv3DCudnnKernel, float, phi::float16) {}
PD_REGISTER_KERNEL(depthwise_conv2d, GPUDNN, ALL_LAYOUT, phi::DepthwiseConvCudnnKernel, float, phi::float16) {}
#else
// CUDA branch already includes phi::bfloat16 for CUDNN_VERSION_MIN(8, 1, 0)
#endif
```

The same issue exists in `paddle/phi/kernels/gpudnn/conv_grad_kernel.cu` for backward kernels (`conv2d_grad`, `conv3d_grad`, etc.).

At the end of `conv_kernel.cu` there's even a comment: `// todo register bfloat16`.

## Impact

This significantly impacts BF16 inference on AMD GPUs:

- **PaddleOCR-VL-1.5**: The vision encoder (SigLIP) uses `conv2d`; cannot run in BF16 on AMD GPU
- **Memory overhead**: FP32 fallback doubles VRAM usage compared to BF16
- **Performance loss**: Cannot leverage AMD GPU BF16 compute throughput (MI300X has native BF16 Tensor Core support)
- **Ecosystem**: Affects all LLMs and multimodal models using BF16 on ROCm

## Reproduction

Environment: AMD GPU (MI300X), ROCm 7.0.51, PaddlePaddle 3.4.0.dev20260123

```python
import paddle
print("ROCm:", paddle.is_compiled_with_rocm())  # True

paddle.set_device("gpu")
x = paddle.randn([1, 3, 64, 64]).astype("bfloat16")
w = paddle.randn([8, 3, 3, 3]).astype("bfloat16")
y = paddle.nn.functional.conv2d(x, w)  # RuntimeError: kernel not registered
```

## Current Workaround

Downstream projects (e.g., PaddleX) work around this by:
1. Returning `False` from `is_bfloat16_available()` on ROCm in `paddlex/inference/utils/misc.py`
2. Forcing the entire visual encoder to FP32 via `_keep_in_fp32_modules = ["visual", "mlp_AR"]`
3. Disabling conv2d fusion passes (`conv2d_add_act_fuse_pass`, `conv2d_add_fuse_pass`)

## Proposed Fix

Add `phi::bfloat16` to the HIP kernel registration macros in:
- `paddle/phi/kernels/gpudnn/conv_kernel.cu`: `conv2d`, `conv3d`, `depthwise_conv2d`
- `paddle/phi/kernels/gpudnn/conv_grad_kernel.cu`: `conv2d_grad`, `conv3d_grad`, `conv2d_double_grad`, `conv3d_double_grad`, `depthwise_conv2d_double_grad`

The underlying MIOpen infrastructure already supports BF16 for convolutions (`miopenBFloat16` is mapped in `miopen_helper.h`, and `miopenConvolutionForward` is called in the HIP path).

## Note

Per [MIOpen datatypes documentation](https://rocm.docs.amd.com/projects/MIOpen/en/latest/reference/datatypes.html), `miopenBFloat16` is fully supported for **convolutions**, **tensor set**, and **tensor copy**. Non-convolution BF16 ops (softmax, pooling, batchnorm, etc.) may need separate fallback handling but are out of scope for this fix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP/ROCm: conv2d/conv3d/depthwise_conv2d kernels missing bfloat16 registration #78586

Issue Description

Problem

Root Cause

Impact

Reproduction

Current Workaround

Proposed Fix

Note

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HIP/ROCm: conv2d/conv3d/depthwise_conv2d kernels missing bfloat16 registration #78586

Description

Issue Description

Problem

Root Cause

Impact

Reproduction

Current Workaround

Proposed Fix

Note

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions