Skip to content

[FEAT] Add bf16 gemm #1974

@vadiklyutiy

Description

@vadiklyutiy

FI has fp8 and fp4 gemm implementations. But there is no bf16 one.

Original issue was found in vllm and described in vllm-project/vllm#27173.

In short, torch.nn.functional.linear is not optimal for small batch sizes. Torch team said that they just call cuBLAS.

It makes sense to support bp16 gemm and do tuning through cuBLAS, cutlass, cuDNN, internal FI implementation as it done for fp8 and fp4 cases.

Performance result and script for measurement are in vllm-project/vllm#27173.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions