We should align it with the likes of bmm, mm, conv, etc. Currently, a bf16 matmul does not accumulate to f32, which is almost certainly incorrect.