Releases: uxlfoundation/oneDNN
v3.11
Performance Optimizations
Intel 64/AMD64 Processors
- Improved
fp32matmul performance withfp4compressed weights. - Improved
fp32matmul performance for cases when one of the tensors has a trivial dimension on processors with Intel AVX-512 instruction set support.
Intel Graphics
- Improved
fp16/bf16matmul performance for large tensor cases on Intel Graphics for Intel Core Ultra processor Series 3 (formerly Panther Lake). - Improved matmul performance for cases with 4-byte alignment on Intel GPUs based on Xe2 architecture.
- Improved performance of
fp16/bf16matmul withmxfp4weights. - Improved convolution performance with host-side scalar scales and zero points.
- Improved matmul performance for LLM inference workloads on Intel GPUs based on Xe2/Xe3 architectures.
- Improved
f32SDPA performance for small head sizes.
AArch64 Processors
- Improved performance of
bf16matmul. - Improved performance of
bf16/int8convolutions. - Improved matmul performance for cases when one of the tensor has a trivial dimension.
- Improved performance of
s8/u8eltwise post-ops on Arm(R) Neoverse(TM) V1 processors. - Improved
f16andbf16eltwise performance withabs,relu,square,sqrt,clip, andclip_v2algorithms. - Improved eltwise
expalgorithm performance on Arm(R) Neoverse(TM) N1 processors. - Improved reorder primitive performance.
RISC-V Processors
- Improved
f32matmul, inner product, convolution, softmax, batch normalization, layer normalization, and group normalization primitives performance. - Improved eltwise and binary primitives performance.
- Improved
f32andfp16pooling primitive performance. - Improved
fp32tou8reorder primitive performance.
Functionality
Functional API
- Introduced destination tensor dynamic quantization in matmul primitive following Open Compute Microscaling (MX) formats specification. See MXFP8 matmul tutorial for quick introduction into MX-capabilities in oneDNN.
- Introduced support for NVFP4 quantization scheme. The changes include support for
fp8_e4m3grouped scales and dynamic quantization support for destination tensor with NVFP4-specific formula for scales computation. - Introduced support for dropout as a primitive attribute for matmul, softmax and eltwise primitives.
Graph API
- Introduced support for RMS Normalization operation.
- Introduced support for output gradient of attention mask for SDPA and GQA training.
Intel Graphics
- Introduced support for convolution with
u8weights. - Introduced support for 2D grouped scales in
fp8and dual zero points in matmul. - Extended support for 5D and 6D tensors in matmul with post-ops.
Intel 64/AMD64 Processors
- Introduced support for different data types of source and destination in pooling forward propagation.
AArch64 Processors
- Added limited support for the BRGEMM Microkernel API
- Added limited support for Windows on Arm builds with MSVC
Usability
Common
- Extended quantization attributes documentation to cover all quantization schemes supported by the library.
- Added matmul fp8 quantization example demonstrating use of matmul primitive with
fp8source, destination, and weights. - Enabled
ONEDNN_ENABLE_GRAPH_DUMPknob by default.
Intel 64/AMD64 Processors
- Extended oneDNN threadpool runtime with an option to support asynchronous execution and updated all CPU implementations accordingly. This extension makes oneDNN compatible with OpenXLA "thunk" runtime.
- Introduced
ONEDNN_SAFE_RBPbuild knob that instructs x64 implementations to preserve value ofrbpregister for tools that rely on stack unwinding. This option may have visible performance impact on some workloads.
AArch64 Processors
- Fixed a potential overflow on AArch64 builds with Arm Compute Library.
- Significantly reduced memory consumption of convolution primitive with large spatial filters during primitive creation.
Intel Graphics
- Removed build time dependency on OpenCL runtime in SYCL build configuration.
Validation
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24, Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, czekun @ZackyLake, Deeksha Kasture @kasturedeeksha, Fadi Arafeh @fadara01, Gassan Salama @gassan-arm, Henry Gardiner @henry-gar, @jstachowintel, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Murray Steele @murste01, Narendra Bagria @narenbagria, Joseph Kuo @PershingSquare, @pmanczak, @vishwascm, Yejing Lai @Yejing-Lai, 夏卓昭 @xiazhuozhao
v3.11-rc
Performance Optimizations
Intel 64/AMD64 Processors
- Improved
fp32matmul performance withfp4compressed weights. - Improved
fp32matmul performance for cases when one of the tensors has a trivial dimension on processors with Intel AVX-512 instruction set support.
Intel Graphics
- Improved
fp16/bf16matmul performance for large tensor cases on Intel Arc graphics for Intel Core Ultra processor series 3 (formerly Panther Lake). - Improved matmul performance for cases with 4-byte alignment on Intel GPUs based on Xe2 architecture.
- Improved performance of
fp16/bf16matmul withmxfp4weights. - Improved convolution performance with host-side scalar scales and zero points.
AArch64 Processors
- Improved performance of
s8/u8eltwise post-ops on Arm(R) Neoverse(TM) V1 processors. - Improved
f16andbf16eltwise performance forabs,relu,square,sqrt,clip, andclip_v2. - Improved
expeltwise performance on Arm(R) Neoverse(TM) N1 processors. - Improved reorder primitive performance.
- Added matmul optimizations for GEMVs.
- Improved performance of
bf16matmul. - Improved performance of
bf16/int8convolutions. - Convolutions with large spatial filters now consume much less memory during primitive setup.
RISC-V Processors
- Improved eltwise and binary primitives performance.
- Improved
f32GEMM performance. - Improved
f32matmul, softmax, convolution and inner product primitives performance. - Improved
f32batch, group and layer normalization primitives performance. - Improved
f32andfp16pooling primitive performance. - Improved reorder(
fp32tou8) primitive performance.
Functionality
Functional API
- Introduced destination tensor dynamic quantization in matmul primitive following Open Compute Microscaling (MX) formats specification. See MXFP8 matmul tutorial for quick introduction into MX-capabilities in oneDNN.
- Introduced support for NVFP4 quantization scheme. The changes include support for
fp8_e4m3grouped scales and dynamic quantization support for destination tensor with NVFP4-specific formula for scales computation. - Introduced support for dropout as a primitive attribute for matmul, softmax and eltwise primitives.
Graph API
- Introduced support for RMS Normalization operation.
- Introduced support for output gradient of attention mask for SDPA and GQA training.
Intel Graphics
- Introduced support for convolution with
u8weights. - Introduced support for 2D grouped scales in
fp8matmul.
Intel 64/AMD64 Processors
- Introduced support for different data types of source and destination in pooling forward propagation.
AArch64 Processors
- Added limited support for the BRGEMM Microkernel API.
- Added limited support for Windows on Arm builds with MSVC.
Usability
- Extended quantization attributes documentation to cover all quantization schemes supported by the library.
- Added matmul fp8 quantization example demonstrating use of matmul primitive with
fp8source, destination, and weights. - Extended oneDNN threadpool runtime with an option to support asynchronous execution and updated all CPU implementations accordingly. This extension makes oneDNN compatible with OpenXLA "thunk" runtime.
- Extended information about primitive execution available in VTune(TM) Profiler with the same level of detail as reported by oneDNN verbose mode. This feature requires VTune Profiler 2025.7 or later.
- Introduced
ONEDNN_SAFE_RBPbuild knob that instructs x64 implementations to preserve value ofrbpregister for tools that rely on stack unwinding. This option may have visible performance impact on some workloads. - Removed build time dependency on OpenCL runtime in SYCL build configuration.
ONEDNN_ENABLE_GRAPH_DUMPbuild knob is enabled by default.- Fixed a potential overflow on AArch64 builds with Arm Compute Library.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24, Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, czekun @ZackyLake, Deeksha Kasture @kasturedeeksha, Fadi Arafeh @fadara01, Gassan Salama @gassan-arm, Henry Gardiner @henry-gar, @jstachowintel, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Murray Steele @murste01, Narendra Bagria @narenbagria, Joseph Kuo @PershingSquare, @pmanczak, @vishwascm, Yejing Lai @Yejing-Lai, 夏卓昭 @xiazhuozhao.
v3.10.2
This is a patch release containing the following changes to v3.10.1:
- Fixed a memory leak in Graph API related to host scalars use (0441245)
- Fixed
f16matmul performance regression withint4weights on Intel Arc graphics for Intel Core Ultra processors (Series 3) (789711c, a160247) - Fixed
bf16matmul performance regression on Intel Xeon processors with Intel AMX instruction set support (c29ec26) - Changed register allocation in BRGEMM kernel to avoid register conflicts and improve code safety (95d651b)
- Fixed a crash related to incorrect caching of
int8convolution primitive on Intel GPUs (28ccca4, 0bc8060) - Fixed a bug preventing correct detection of Intel AVX 10.2 instruction set on Intel Xeon processors (568171c)
v3.10.1
This is a patch release containing the following changes to v3.10:
- Fixed an issue with reorder primitive returning
unimplementedfor cases when only one scale mask is defined on AArch64 processors (be92457) - Fixed sporadic correctness issue in
fp32matmul on Intel GPUs based on Xe2 architecture (b4a761c) - Fixed correctness issue in
fp16/bf16matmul on Intel GPUs based on Xe3 architecture (48c114b) - Fixed performance regression in
bf16convolution weight gradient on Intel Arc Graphics B-series (3b6665b) - Improved convolution performance on AArch64 processors with SVE128 support (808227d)
- Fixed regression in matmul primitive creation time on Intel GPUs (599ecb5)
- Fixed potential overflow for matmul, convolution and inner product primitives with Arm Compute Library (be12d8c)
- Fixed convolution performance regression on Intel Arc Graphics B-series (7e27159)
v3.10
Performance Optimizations
Intel Architecture Processors
- Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment
variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2. - Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality
is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512. - Improved performance of matmul primitive on processors with Intel AMX support.
- Improved performance of
f32matmul primitive for GEMV cases on on processors with Intel AVX2 instruction
set support. - Improved matmul performance with
int4andint8compressed weights and per-channel zero-points. - Improved
f32matmul performance withint4andint8compressed weights on processors with Intel AVX2 and
Intel AVX512 instruction set support. - Improved
bf16matmul performance withint4andint8compressed weights on processors with Intel AVX512,
Intel DL Boost and bfloat16 instruction set support. - Improved performance of
int8convolution primitive when using zero points. - Improved performance of
int8matmul and inner product primitives withfp16destination. - Improved performance of
f32andbf16convolution primitive withint8destination. - Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
- Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
Intel Graphics Products
- Improve GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly Alchemist) and
Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H). - Improved
int8matmul performance withint4weights and per-tensor zero-points. - Improved
bf16matmul performance withfp8weights. - Graph API optimizations:
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode
is enabled on Intel Core Ultra processors (formerly Meteor Lake). - Improved SDPA and GQA subgraphs performance when using host-side scalars.
- Improved performance of GQA subgraph for 2nd token scenarios.
- Improved performance of subgraphs containing sequence of multiple binary ops.
- Improved performance of Grouped Query Attention (GQA) subgraphs for training forward and backward propagation.
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode
AArch64-based Processors
- Improved reorder primitive performance.
- Improved
bf16convolutions performance. - Improved convolutions performance on CPUs with 128-bit SVE support.
- Improved eltwise primitive performance on Arm(R) Neoverse(TM) N1 processor.
Functionality
Functional API
- Introduced host-side scalar memory objects. This functionality allows passing host-side scalars instead of device
memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul
and convolution primitives on Intel GPUs. - Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve
performance in case ofint8activations andint8weights with zero-point.
Graph API
- Introduced
host_scalarproperty for logical tensors. This functionality allows passing host-side scalars instead
of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to
define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. - Introduced accumulation mode attribute support in
Matmulop. This attribute allows relaxingfp32accumulation
requirements to achieve performance benefits on some platforms.
Intel Graphics Products
- Introduced support for
fp4weights in matmul primitive. - Introduced support for weight scales and zero-points with group size 16 in matmul with compressed weights.
Intel Architecture Processors
- Introduced
fp4weights support forfp32matmul and convolution for future Intel Xeon processors with
Intel AVX10.2 instruction set support.
Usability
- Extended diagnostics available in verbose mode for primitive descriptor creation issues.
- Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
Known Limitations
- Convolution primitive may require excessive amount of scratchpad memory for shapes with large input width value on Intel CPUs.
bf16convolution primitive has a performance regression on Intel Arc B-series graphics.- Reduction primitive may produce incorrect results for tensors exceeding 4 GB on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
- Concat primitive may produce incorrect results for certain shapes on Intel Arc A-series GPUs.
fp16matmul primitive has a performance regression on Intel GPUs based on Xe2 architecture.f32matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.int8inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Center GPU Max series.bf16layer normalization backpropagation may produce incorrect results on Intel Data Center GPU Max Series.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated
and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
AArch64-based Processors
- Bumped the minimum required Arm(R) Compute Library version to 52.4.0
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24,
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301,
Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117,
Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw,
Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva,
Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc,
@focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0,
@zhangjian29, and @xiazhuozhao.
v3.8.2
This is a patch release containing the following changes to v3.8.1:
- Fixed performance regression for
f32convolution primitive on processors with Intel AVX-512 instruction set support (5f3af68) - Introduced support for
f16destination inint8matmul andint8inner product on x64 CPUs (53fd12a, 22e252c, f5b2d7f, e4e2f1c) - Improved RNN primitive performance on processors with Intel AVX2 instruction set support (71e5d81, eb27db2, dd4e627, ff134e0, 5a86c1f, e9395ae)
- Improved
fp32matmul performance on processors with Intel AVX-512 instruction set support (1119339) - Fixed segmentation fault in
f32binary primitive with broadcast on x64 processors (2082e98) - Fixed correctness issue in
f64convolution weight gradient with bias on Intel Arc GPUs (a00bfab) - Updated
spdlogcomponent to version 1.15.3 (dbb3629) - Fixed potential undefined behavior in convolution on Intel GPUs (5ac3e31)
- Fixed segmentation fault in convolution implementation with trivial filter on Intel CPUs (908c5fc, f0a0eee)
- Fixed segmentation fault in
f16convolution with odd dimensions on processors with Intel AVX10.1 instruction set support (78d6835) - Improved convolution primitive descriptor creation time on x64 processors (e9c5366, fd9dc58, f1d038e)
- Fixed performance regression in
f16matmul withint4weights on Intel Arc Graphics B-series (38d761b) - Improved
bf16matmul performance on processors with Intel AMX instruction set support (0887aec) - Fixed correctness issue in
f32RNN primitive on processors with Intel AMX instruction set support (460a014)
v3.10-rc
Performance Optimizations
Intel Architecture Processors
- Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2. - Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512. - Improved performance of matmul primitive on processors with Intel AMX support.
- Improved performance of
f32matmul primitive for GEMV cases on on processors with Intel AVX2 instruction set support. - Improved matmul performance with
int4andint8compressed weights and per-channel zero-points. - Improved
f32matmul performance withint4andint8compressed weights on processors with Intel AVX2 and Intel AVX512 instruction set support. - Improved
bf16matmul performance withint4andint8compressed weights on processors with Intel AVX512, Intel DL Boost and bfloat16 instruction set support. - Improved performance of
int8convolution primitive when using zero points. - Improved performance of
int8matmul and inner product primitives withfp16destination. - Improved performance of
f32andbf16convolution primitive withint8destination. - Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
- Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
Intel Graphics Products
- Improved GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
- Improved
int8matmul performance withint4weights and per-tensor zero-points. - Improved
bf16matmul performance withfp8weights. - Graph API optimizations:
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode is enabled on Intel Core Ultra processors (formerly Meteor Lake).
- Improved SDPA and GQA subgraphs performance when using host-side scalars.
- Improved performance of GQA subgraph for 2nd token scenarios.
- Improved performance of subgraphs containing sequence of multiple binary ops.
- Improved performance of Grouped Query Attention (GQA) subgraphs for training forward and backward propagation.
AArch64-based Processors
- Improved performance of reorder primitive
- Improved performance of
bf16convolutions - Improved performance of convolutions on 128-bit SVE platforms
- Improved performance of eltwise on Arm(R) Neoverse(TM) N1
Functionality
Functional API
- Introduced host-side scalar memory objects. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul and convolution primitives on Intel GPUs.
- Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve performance in case of
int8activations andint8weights with zero-point.
Graph API
- Introduced
host_scalarproperty for logical tensors. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. - Introduced accumulation mode attribute support in
Matmulop. This attribute allows relaxingfp32accumulation requirements to achieve performance benefits on some platforms.
Intel Graphics Products
- Introduced support for
fp4weights in matmul primitive. - Introduced support for grouped quantization with group size 16 in matmul with
int8compressed weights. - Introduced support group size 16
int8for decompressed weight with regular weights decompression.
Intel Architecture Processors
- Introduced
fp4weights support forfp32matmul and convolution for future Intel Xeon processors with Intel AVX10.2 instruction set support.
Usability
- Extended diagnostics available in verbose mode for primitive descriptor creation issues.
- Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
AArch64-based Processors
- Bumped the minimum required Arm(R) Compute Library 52.4.0
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24,
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301, Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117, Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva, Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc, @focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0, @zhangjian29, and @xiazhuozhao.
v3.9.2
This is a patch release containing the following changes to v3.9.1:
- Fixed correctness issue in
int8convolution on processors with Intel AVX2 and Intel DL Boost instruction set support (a7c4079, 78e781f) - Fixed performance regression for
f32convolution primitive on processors with Intel AVX-512 instruction set support (74f23b4) - Fixed performance regression for RNN primitive with LBR GRU cell type on Intel Arc GPUs (ae2844e)
- Fixed performance regression for
int8convolution primitive when using zero points (dbb8484) - Fixed segmentation fault in matmul primitive when using
ONEDNN_VERBOSE=all(7310aa2) - Fixed correctness issue in multi-dimensional matmul primitive on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids) (642d18b)
- Reduced problem size in
test_sdpa_decomptest (9bff06e) - Restricted
test_sdpa_decompandtest_mqa_decomptests toOMPorTHREADPOOLCPU runtimes (3cd9170) - Fixed illegal instruction issue in pooling primitive on processors with Intel SSE4.1 support (d907c47)
- Fixed segmentation fault issue in
f16backward convolution primitive on processors with Intel AVX2 with Intel DL Boost with float16 and bfloat16 support (50cc228, fcc7e5e) - Restored support for
int8matmul withper_ocscales and zero points on Intel Arc GPUs (1a5a454, 04c22c9)
v3.9.1
This is a patch release containing the following changes to v3.9:
- Reduced sizes in Graph API SDPA examples (257d689)
- Fixed correctness issue in
bf16depthwise convolution withbf16bias on AArch64 CPUs (218b41d) - Changed Intel GPU data alignment check from error to warning (5c5008a)
- Improved
bf16matmul performance on processors with Intel AMX instruction set support (54b6354, 30c4d8d) - Fixed PowerPC64 build by adding
-mcpu=power10and-mmmaflags (02ca915) - Introduced support for
f16destination inint8matmul andint8inner product on x64 CPUs (a62ed6b, 53c0a66, 0750043, 4f0f068) - Introduced support
per_tensorzero-points inint8matmul on Intel GPUs (db8e8ff, f783164, 4d458df, 80453a0, 7f90d50, a2200e2) - Fixed correctness issue in
int8reorder for cases with compensation on x64 CPUs (771ca54)
v3.9
Performance Optimizations
Intel Architecture Processors
- Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2. - Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512. - Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
- Improved performance of
fp8convolution primitive with scales andbf16output - Improved performance of matmul primitive with post-ops on processors with Intel AMX support
- Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
- Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.
Intel Graphics Products
- Improved performance on Intel GPUs based on Xe3 architecture.
- Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved RNN primitive performance with LBR_GRU cell type.
- Improved
int8convolution performance with plain weights and trivial filter. - Improved convolution performance with
NCHWactivations with 1x1 filter and unit strides. - Improved
fp32softmax performance. - Improved performance of reorder when used with USM host memory.
- Improved performance of the following subgraphs with Graph API:
fp32SDPA with implicit causal mask.fp16SDPA on Intel GPUs without Intel XMX cores.
AArch64-based Processors
- Improved
int8convolution performance. - Improved
bf16depthwise convolution performance. - Improved
f16matmul performance with Arm Compute Library (ACL).
Functionality
Functional API
- Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
- Sparse memory objects and sparse matmul are promoted to production status.
Graph API
- Introduced support for tanh approximation in
GELUoperation. - Extended Graph API
Softmaxoperation to support optionalstatsoutput. - Introduced fusion support for SDPA training forward and backward propagation.
- Introduced fusion support for SDPA with bottom-right implicit causal mask.
- Introduced
make_scalar_tensor()API for engine-agnostic scalar tensor creation.
Microkernel API
- Introduced support for
fp8data type.
Intel Architecture Processors
- Introduced support for select algorithm in binary post-op.
- Introduced source, destination, and weight scales support in
fp8convolution and deconvolution primitives.
Intel Graphics Products
- Introduced support for select algorithm in binary primitive.
Generic GPU Vendor
- Introduced support for RNN Vanilla backward propagation.
Usability
- Enabled build with
-Wundefcompiler flag. - [Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.
Validation
- Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
- Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator.
Known Limitations
- The group normalization with
normalization_flags::use_scalespecified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier. - Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows.
fp8matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.int8inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Сenter GPU Max series.bf16pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Сenter GPU Max series.bf16/fp16matmul with large inner dimension has a performance regression on Intel Data Сenter GPU Max Series.bf16/fp16convolution withNCHWactivations has a performance regression on Intel Data Сenter GPU Max Series.- Softmax with non-trivial strides and blocked format may produce incorrect results.
bf16layer normalization backpropagation may produce incorrect results on Intel Data Сenter GPU Max Series.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm,dnnl::gemm_u8s8s32, anddnnl::gemm_s8s8s32functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.