Skip to content

v0.5.0

Latest
Compare
Choose a tag to compare
@SS-JIA SS-JIA released this 30 Jan 17:10
· 396 commits to main since this release
484a4ab

The 0.5 release of ExecuTorch accompanies the release of PyTorch 2.6, and includes various updates and improvements to ExecuTorch’s backend delegates, as well as slight improvements to the Python and C++ APIs. Most notably, dim order has been enabled in ExecuTorch export by default. For more details, please see this post

On the front of Llama model support, an eager runner has been added to the Llama example to allow running inference in eager mode; additionally, support for AttentionSink has been added for eager mode execution.

API Changes

  • Introduced a C++ TensorAccessor class for ExecuTorch tensors based on PyTorch’s TensorAccessor class
  • Introduced a Python save(path: str) to ExecutorchProgramManager to reduce boilerplate code required to serialize to a .pte file
  • Introduced the C++ PlatformMemoryAllocator class to allow kernel authors to provide their own memory allocation implementation
  • Introduced the C++ num_instructions() function to the C++ Method class
  • Enabled direct serialization of uint16 types in ExecuTorch programs

Build

  • ExecuTorch nightly binaries are now built only for python 3.10, 3.11 and 3.12
  • Introduced nightly builds for Apple platforms, which can be found listed here
  • Added support for NumPy 2

Backends

Arm

  • Added support for the following operators:
    1D convolution, Tanh activation, select, 2D max pooling, upsample_nearest2d, cat/stack, rshift, concat, log_softmax, var, layer_norm
  • Improved support of reduction operators
  • Extended softmax to handle dim < 0
  • Added support for keep_dims == True for mean and var operators
  • Enabled reporting of Ethos-U PMU hardware counters in the Arm delegate executor
  • Multiple TOSA Spec support
  • Adding model evaluation functionality to the AOT Compiler

Cadence

  • Migrated most of the graph level compiler from internal Meta location to OSS location
  • Cadence OSS flow is now using ~50 graph-level optimization passes
  • Various improvements to the export workflow for Cadence chips
  • Expanded operator support to include 33 ATen operators and 11 quantized operators
  • Integrated multiple optimized kernels for HiFi and Fusion chips, resulting in large performance gains (double digit percent to orders of magnitude)
  • Enabled mobilenet_v2 and resnet50 as e2e tests

CoreML

  • Added the option to specify which CoreML compute unit to use in the Llama model export script
  • Fixed a compilation crash on iOS <16
  • Added support for dim order

Qualcomm

  • Enabled batch prefill for llama with weight sharing feature
  • Various improvements to Llama model support for both prefill and decode, including sha, static_llama (kv cache as io), graph break reduction, and more
  • Added example for the wav2letter model
  • Added support for the retinanet_fpn model
  • Added support for the SA8295 SoC
  • Added support for QAT
  • Added support dim order
  • Added DrawGraph utility for graph visualization

MediaTek

  • Integrated the MediaTek backend in the Android Llama application
  • Added support for dim order

MPS

  • Added support for dim order

Vulkan

  • Improved support for Llama model architectures in the Vulkan backend:
    • Added implementation of SDPA + KV cache updated fused operator
    • Added implementation of rotary embeddings
  • Various improvements to compute shader latency and memory footprint, such as:
    • Introduced support for push constants in compute shaders, used to pass in tensor metadata (i.e. sizes)
    • Switched from VK_IMAGE_TILING_OPTIMAL to VK_IMAGE_TILING_LINEAR as the default texture tiling setting which greatly reduces memory footprint of image textures used to store tensors
    • Reduced register pressure in compute shaders by using lower precision integer types to store texture positions and tensor indices
  • Added export pass to automatically insert transition ops to switch to a different optimal/required storage types or memory layout between operators in the export graph

XNNPACK

  • Updated XNNPACK Version to commit hash 1ed874e65 which includes the newest KleidiAI Blockwise Kernels which gives around 20% performance improvement on Llama Prefill.
  • Support for delegating models quantized via torchao’s quantize_ api
  • New Partitioner XNNPACK Partitioner, with configurable settings that allow users greater control over how ops are partitioned
  • Support for to_edge_transform_and_lower, leveraging this API with the partitioner provides more stable lowerings
  • Allowed addmm and mm to call dynamic fp32 kernels
  • Fixes to partitioning of unsupported operators
  • Update cpuinfo dependency to resolve intermittent faults on UNISOC-based phones

Devtools

  • Added a public benchmark dashboard, offering insights into ExecuTorch model performance trends, commit-to-commit comparisons, and anomaly detection. Onboarded Llama3.2-1B to track perf with SpinQuant, QLora, CoreML ANE.
  • Added support for uint16 in the devtools inspector

Llama Model Support

  • Swapped TorchTune attention with custom export-friendly ExecuTorch attention
  • Added llama3_2_vision text decoder as a TorchTune exportable model
  • Added a React Native LLaMA app for iOS devices
  • Added support for the bfloat16 dtype in the LLM runner binary and the export_llama script
  • Added support for AttentionSink in the Llama example
  • Added TorchAO MPS low bit operators to the Llama runner
  • Added support for kv cache quantization; currently only 8-bit per token quantization is supported with FP32 as a dequantized dtype. This can be enabled in the export_llama script using the –quanitze_kv_cache option.
  • Added support for quantized versions of Llama 3.2 1B/3B

Kernel Libraries

  • Implemented several portable operators: pixel_unshuffle, gather, topk, convolution_backward, narrow_copy, masked_select, max.unary_out, min.unary_out, scatter.src_out, scatter.value_out. repeat_interleave.Tensor_out
  • Implemented tile_crop custom operator
  • Implemented scalar trunc primitive operator
  • Implemented BFloat16 support, focusing on LLM operator coverage (op_to_copy, op_mul, op_mm, op_copy, op_slice_scatter, op_scalar_tensor, op_where, op_add, CPUBLAS gemm).
  • Fixed handling of rank 0 tensors in optimized add/sub/div/mul
  • Fixed _native_batch_norm_legit_no_stats_out

First Time Contributors

Thanks to the following contributors for making their first commit for this release!

@navsud, @meyering, @tugsbayasgalan, @Abhishek8394, @RahulK4102, @RdoubleA, @varunchariArm, @laithsakka, @limintang, @veselinp, @MaggieMoss, @azad-meta, @anyj0527, @jainapurva, @suchir1, @ru-m8, @wdvr, @anijain2305, @tianxf99, @sxu, @f-meloni, @Vysarat, @georgehong, @lg-zhang, @h-friederich, @AIWintermuteAI, @itisgrisha, @ykhrustalev, @hietalajulius, @Nick-Wei, @Abhi-hpp, @KapJI, @YIWENX14, @clee2000, @Michiel-Olieslagers, @karthik-manju, @jakmro, @Aleksei-grovety,

Full Changelog: v0.4.0...v0.5.0