The 0.5 release of ExecuTorch accompanies the release of PyTorch 2.6, and includes various updates and improvements to ExecuTorch’s backend delegates, as well as slight improvements to the Python and C++ APIs. Most notably, dim order has been enabled in ExecuTorch export by default. For more details, please see this post
On the front of Llama model support, an eager runner has been added to the Llama example to allow running inference in eager mode; additionally, support for AttentionSink has been added for eager mode execution.
API Changes
- Introduced a C++
TensorAccessor
class for ExecuTorch tensors based on PyTorch’sTensorAccessor
class - Introduced a Python
save(path: str)
toExecutorchProgramManager
to reduce boilerplate code required to serialize to a.pte
file - Introduced the C++
PlatformMemoryAllocator
class to allow kernel authors to provide their own memory allocation implementation - Introduced the C++
num_instructions()
function to the C++Method
class - Enabled direct serialization of
uint16 types
in ExecuTorch programs
Build
- ExecuTorch nightly binaries are now built only for python
3.10
,3.11
and3.12
- Introduced nightly builds for Apple platforms, which can be found listed here
- Added support for NumPy 2
Backends
Arm
- Added support for the following operators:
1D convolution, Tanh activation,select
, 2D max pooling,upsample_nearest2d
,cat
/stack
,rshift
,concat
,log_softmax
,var
,layer_norm
- Improved support of reduction operators
- Extended softmax to handle dim < 0
- Added support for
keep_dims == True
formean
andvar
operators - Enabled reporting of Ethos-U PMU hardware counters in the Arm delegate executor
- Multiple TOSA Spec support
- Adding model evaluation functionality to the AOT Compiler
Cadence
- Migrated most of the graph level compiler from internal Meta location to OSS location
- Cadence OSS flow is now using ~50 graph-level optimization passes
- Various improvements to the export workflow for Cadence chips
- Expanded operator support to include 33 ATen operators and 11 quantized operators
- Integrated multiple optimized kernels for HiFi and Fusion chips, resulting in large performance gains (double digit percent to orders of magnitude)
- Enabled
mobilenet_v2
andresnet50
as e2e tests
CoreML
- Added the option to specify which CoreML compute unit to use in the Llama model export script
- Fixed a compilation crash on iOS <16
- Added support for dim order
Qualcomm
- Enabled batch prefill for llama with weight sharing feature
- Various improvements to Llama model support for both prefill and decode, including sha, static_llama (kv cache as io), graph break reduction, and more
- Added example for the
wav2letter
model - Added support for the
retinanet_fpn
model - Added support for the SA8295 SoC
- Added support for QAT
- Added support dim order
- Added
DrawGraph
utility for graph visualization
MediaTek
- Integrated the MediaTek backend in the Android Llama application
- Added support for dim order
MPS
- Added support for dim order
Vulkan
- Improved support for Llama model architectures in the Vulkan backend:
- Added implementation of SDPA + KV cache updated fused operator
- Added implementation of rotary embeddings
- Various improvements to compute shader latency and memory footprint, such as:
- Introduced support for push constants in compute shaders, used to pass in tensor metadata (i.e. sizes)
- Switched from
VK_IMAGE_TILING_OPTIMAL
toVK_IMAGE_TILING_LINEAR
as the default texture tiling setting which greatly reduces memory footprint of image textures used to store tensors - Reduced register pressure in compute shaders by using lower precision integer types to store texture positions and tensor indices
- Added export pass to automatically insert transition ops to switch to a different optimal/required storage types or memory layout between operators in the export graph
XNNPACK
- Updated XNNPACK Version to commit hash
1ed874e65
which includes the newest KleidiAI Blockwise Kernels which gives around 20% performance improvement on Llama Prefill. - Support for delegating models quantized via
torchao
’squantize_
api - New Partitioner XNNPACK Partitioner, with configurable settings that allow users greater control over how ops are partitioned
- Support for
to_edge_transform_and_lower
, leveraging this API with the partitioner provides more stable lowerings - Allowed
addmm
andmm
to call dynamic fp32 kernels - Fixes to partitioning of unsupported operators
- Update
cpuinfo
dependency to resolve intermittent faults on UNISOC-based phones
Devtools
- Added a public benchmark dashboard, offering insights into ExecuTorch model performance trends, commit-to-commit comparisons, and anomaly detection. Onboarded Llama3.2-1B to track perf with SpinQuant, QLora, CoreML ANE.
- Added support for
uint16
in the devtools inspector
Llama Model Support
- Swapped TorchTune attention with custom export-friendly ExecuTorch attention
- Added
llama3_2_vision
text decoder as a TorchTune exportable model - Added a React Native LLaMA app for iOS devices
- Added support for the
bfloat16
dtype in the LLM runner binary and theexport_llama
script - Added support for AttentionSink in the Llama example
- Added TorchAO MPS low bit operators to the Llama runner
- Added support for kv cache quantization; currently only 8-bit per token quantization is supported with FP32 as a dequantized dtype. This can be enabled in the
export_llama
script using the–quanitze_kv_cache
option. - Added support for quantized versions of Llama 3.2 1B/3B
Kernel Libraries
- Implemented several portable operators:
pixel_unshuffle
,gather
,topk
,convolution_backward
,narrow_copy
,masked_select
,max.unary_out
,min.unary_out
,scatter.src_out
,scatter.value_out
.repeat_interleave.Tensor_out
- Implemented
tile_crop
custom operator - Implemented scalar
trunc
primitive operator - Implemented BFloat16 support, focusing on LLM operator coverage (
op_to_copy
,op_mul
,op_mm
,op_copy
,op_slice_scatter
,op_scalar_tensor
,op_where
,op_add
, CPUBLAS gemm). - Fixed handling of rank 0 tensors in optimized
add
/sub
/div
/mul
- Fixed
_native_batch_norm_legit_no_stats_out
First Time Contributors
Thanks to the following contributors for making their first commit for this release!
@navsud, @meyering, @tugsbayasgalan, @Abhishek8394, @RahulK4102, @RdoubleA, @varunchariArm, @laithsakka, @limintang, @veselinp, @MaggieMoss, @azad-meta, @anyj0527, @jainapurva, @suchir1, @ru-m8, @wdvr, @anijain2305, @tianxf99, @sxu, @f-meloni, @Vysarat, @georgehong, @lg-zhang, @h-friederich, @AIWintermuteAI, @itisgrisha, @ykhrustalev, @hietalajulius, @Nick-Wei, @Abhi-hpp, @KapJI, @YIWENX14, @clee2000, @Michiel-Olieslagers, @karthik-manju, @jakmro, @Aleksei-grovety,
Full Changelog: v0.4.0...v0.5.0