FI has fp8 and fp4 gemm implementations. But there is no bf16 one.
Original issue was found in vllm and described in vllm-project/vllm#27173.
In short, torch.nn.functional.linear is not optimal for small batch sizes. Torch team said that they just call cuBLAS.
It makes sense to support bp16 gemm and do tuning through cuBLAS, cutlass, cuDNN, internal FI implementation as it done for fp8 and fp4 cases.
Performance result and script for measurement are in vllm-project/vllm#27173.
FI has fp8 and fp4 gemm implementations. But there is no bf16 one.
Original issue was found in vllm and described in vllm-project/vllm#27173.
In short,
torch.nn.functional.linearis not optimal for small batch sizes. Torch team said that they just call cuBLAS.It makes sense to support bp16 gemm and do tuning through cuBLAS, cutlass, cuDNN, internal FI implementation as it done for fp8 and fp4 cases.
Performance result and script for measurement are in vllm-project/vllm#27173.