Performance difference for Quantized Matmul between v2.6.3 and v3.x

I am performing benchmarking analysis using benchdnn between OneDNN v2.6.3 and OneDNN v3.10.2. I observed that v2.6.3 performs 10-12% better than v3.10.2 when running with a single thread (**OMP_NUM_THREADS=1**).
Data type combinations tested: u8:s8:u8 and u8:s8:f32

Here are the cmd I am using for this exp:
**_v2.6.3:_**
`numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --cfg=u8s8u8 --bia_dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-oscale=per_oc:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt
`
**_v3.10.2:_**
`numactl --physcpubind=0 --interleave=0 tests/benchdnn/benchdnn --matmul --mode=P --dt=u8:s8:u8 --bia-dt=f32 --stag=ab --wtag=any --dtag=ab --fix-times-per-prb=10000 --attr-zero-points=src:common:1+dst:common:1 --attr-scales=src:common:1.5+wei:per_oc+dst:common:2.5 --attr-post-ops='eltwise_relu' --batch=input_relu_u8.txt`

Please find some sample combinations and their behavior
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/psrialav/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/psrialav/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">


M | K | N | Dtype | OneDNN v2.6.3 | OneDNN v3.10.2 | v2.6.3 / v3.10.2
-- | -- | -- | -- | -- | -- | --
200 | 13 | 512 | u8:s8:u8 | 0.011 | 0.012 | 0.92
200 | 512 | 256 | u8:s8:u8 | 0.054 | 0.056 | 0.96
200 | 256 | 128 | u8:s8:f32 | 0.014 | 0.015 | 0.93
300 | 13 | 512 | u8:s8:u8 | 0.016 | 0.018 | 0.89
300 | 512 | 256 | u8:s8:u8 | 0.081 | 0.083 | 0.98
300 | 256 | 128 | u8:s8:f32 | 0.021 | 0.022 | 0.95
400 | 13 | 512 | u8:s8:u8 | 0.021 | 0.024 | 0.88



</body>

</html>


Here are the sample logs of these two experiments for reference:
**_v2.6.3:_**
`onednn_verbose,exec,cpu,matmul,brg:avx512_core_vnni,undef,src_u8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f8:zpm2 bia_f32::blocked:ab:f0_mask2 dst_u8::blocked:ab:f0,attr-oscale:2 attr-zero-points:src:0:1+dst:0:1 attr-post-ops:eltwise_relu ,,300x13:13x512:300x512,0.0158691`

**_v3.10.2:_**
`onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:u8::blocked:ab::f0 wei:s8:ap:blocked:BA16a64b4a::f8:zpm2 bia:f32:a:blocked:ab::f0_mask2 dst:u8::blocked:ab::f0,attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_relu,,300x13:13x512,0.0180664`

Questions:
a. Is this expected behavior due to internal changes between v2.6 and v3.x?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance difference for Quantized Matmul between v2.6.3 and v3.x #4663

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

M	K	N	Dtype	OneDNN v2.6.3	OneDNN v3.10.2	v2.6.3 / v3.10.2
200	13	512	u8:s8:u8	0.011	0.012	0.92
200	512	256	u8:s8:u8	0.054	0.056	0.96
200	256	128	u8:s8:f32	0.014	0.015	0.93
300	13	512	u8:s8:u8	0.016	0.018	0.89
300	512	256	u8:s8:u8	0.081	0.083	0.98
300	256	128	u8:s8:f32	0.021	0.022	0.95
400	13	512	u8:s8:u8	0.021	0.024	0.88

Performance difference for Quantized Matmul between v2.6.3 and v3.x #4663

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions