-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
I am performing benchmarking analysis using benchdnn between OneDNN v2.6.3 and OneDNN v3.10.2. I observed that v2.6.3 performs 10-12% better than v3.10.2 when running with a single thread (OMP_NUM_THREADS=1).
Data type combinations tested: u8:s8:u8 and u8:s8:f32
Here are the cmd I am using for this exp:
v2.6.3:
numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --cfg=u8s8u8 --bia_dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-oscale=per_oc:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt
v3.10.2:
numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --dt=u8:s8:u8 --bia-dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-scales=src:common:1.5+wei:per_oc+dst:common:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt
Please find some sample combinations and their behavior
| M | K | N | Dtype | OneDNN v2.6.3 | OneDNN v3.10.2 | v2.6.3 / v3.10.2 |
|---|---|---|---|---|---|---|
| 200 | 13 | 512 | u8:s8:u8 | 0.011 | 0.012 | 0.92 |
| 200 | 512 | 256 | u8:s8:u8 | 0.054 | 0.056 | 0.96 |
| 200 | 256 | 128 | u8:s8:f32 | 0.014 | 0.015 | 0.93 |
| 300 | 13 | 512 | u8:s8:u8 | 0.016 | 0.018 | 0.89 |
| 300 | 512 | 256 | u8:s8:u8 | 0.081 | 0.083 | 0.98 |
| 300 | 256 | 128 | u8:s8:f32 | 0.021 | 0.022 | 0.95 |
| 400 | 13 | 512 | u8:s8:u8 | 0.021 | 0.024 | 0.88 |
Here are the sample logs of these two experiments for reference:
v2.6.3:
onednn_verbose,exec,cpu,matmul,brg:avx512_core_vnni,undef,src_u8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f8:zpm2 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-oscale:2 attr-zero-points:src:0:1+dst:0:1 attr-post-ops:eltwise_relu ,,300x13:13x512:300x512,0.0158691
v3.10.2:
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8::blocked:ab::f0 wei:s8:ap:blocked:BA16a64b4a::f8:zpm2 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_relu,,300x13:13x512,0.0180664
Questions:
a. Is this expected behavior due to internal changes between v2.6 and v3.x?