Skip to content

Performance difference for Quantized Matmul between v2.6.3 and v3.x #4663

@SriAlavandar

Description

@SriAlavandar

I am performing benchmarking analysis using benchdnn between OneDNN v2.6.3 and OneDNN v3.10.2. I observed that v2.6.3 performs 10-12% better than v3.10.2 when running with a single thread (OMP_NUM_THREADS=1).
Data type combinations tested: u8:s8:u8 and u8:s8:f32

Here are the cmd I am using for this exp:
v2.6.3:
numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --cfg=u8s8u8 --bia_dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-oscale=per_oc:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt
v3.10.2:
numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --dt=u8:s8:u8 --bia-dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-scales=src:common:1.5+wei:per_oc+dst:common:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt

Please find some sample combinations and their behavior

M K N Dtype OneDNN v2.6.3 OneDNN v3.10.2 v2.6.3 / v3.10.2
200 13 512 u8:s8:u8 0.011 0.012 0.92
200 512 256 u8:s8:u8 0.054 0.056 0.96
200 256 128 u8:s8:f32 0.014 0.015 0.93
300 13 512 u8:s8:u8 0.016 0.018 0.89
300 512 256 u8:s8:u8 0.081 0.083 0.98
300 256 128 u8:s8:f32 0.021 0.022 0.95
400 13 512 u8:s8:u8 0.021 0.024 0.88

Here are the sample logs of these two experiments for reference:
v2.6.3:
onednn_verbose,exec,cpu,matmul,brg:avx512_core_vnni,undef,src_u8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f8:zpm2 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-oscale:2 attr-zero-points:src:0:1+dst:0:1 attr-post-ops:eltwise_relu ,,300x13:13x512:300x512,0.0158691

v3.10.2:
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8::blocked:ab::f0 wei:s8:ap:blocked:BA16a64b4a::f8:zpm2 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_relu,,300x13:13x512,0.0180664

Questions:
a. Is this expected behavior due to internal changes between v2.6 and v3.x?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions