[MLAS Feature Request] MatMulNBits faster implementation for fp16 input dtype

### Describe the feature request

Please add support for HQNBIT_CompInt8 computation path for MatMulNBits fp16 input datatype. Currently the performance of MatMulNBits is much slower for fp16 vs fp32 (~6x) on CPU.

### Describe scenario use case

For FP16 input dtype, MatMulNBits computation is always falling back to HQNBIT_CompFp16 as there is no implementation present for [HQNBIT_CompInt8](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/mlas/lib/qnbitgemm.cpp#L86) compute type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS Feature Request] MatMulNBits faster implementation for fp16 input dtype #27251

Describe the feature request

Describe scenario use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MLAS Feature Request] MatMulNBits faster implementation for fp16 input dtype #27251

Description

Describe the feature request

Describe scenario use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions