Describe the feature request
Please add support for HQNBIT_CompInt8 computation path for MatMulNBits fp16 input datatype. Currently the performance of MatMulNBits is much slower for fp16 vs fp32 (~6x) on CPU.
Describe scenario use case
For FP16 input dtype, MatMulNBits computation is always falling back to HQNBIT_CompFp16 as there is no implementation present for HQNBIT_CompInt8 compute type.