[FEAT] Add bf16 gemm

FI has fp8 and fp4 gemm implementations. But there is no bf16 one. 

Original issue was found in vllm and described in https://github.com/vllm-project/vllm/issues/27173. 

In short, `torch.nn.functional.linear` is not optimal for small batch sizes. Torch team said that they just call cuBLAS. 

It makes sense to support bp16 gemm and do tuning through cuBLAS, cutlass, cuDNN, internal FI implementation as it done for fp8 and fp4 cases. 

Performance result and script for measurement are in https://github.com/vllm-project/vllm/issues/27173. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add bf16 gemm #1974

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEAT] Add bf16 gemm #1974

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions