18 Dec 21:27

nathanielsimard

91dd62c

v0.20.0-pre.6 Pre-release

Pre-release

What's Changed

doc warning fix by @crutcher in #4130
Fix tch bf16 into_data by @laggui in #4142
Update raspberry-pi-pico example to use the Pico 2, and burnpack by @BjornTheProgrammer in #4132
Unify all_reduce LocalCollectiveClient operation handling. by @crutcher in #4125
Add direct tensor snapshot retrieval API to ModuleStore by @antimora in #4131
Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) by @antimora in #4119
Add removed docs for tensor equal_elem by @laggui in #4145
Add ceil_mode support to pooling operations (MaxPool, AvgPool) by @antimora in #4112
chore: Update cubecl by @wingertge in #4134
Implement Slice iterator and utility methods. by @crutcher in #4042
Bump peter-evans/create-pull-request from 7 to 8 by @dependabot[bot] in #4148
Add slice_dyn, slice_assign_dyn, and slice_fill_dyn variants. by @crutcher in #4127
Add Reshape scalar optimization and Gather scalar input support by @antimora in #4146
Shape FromStr/ToString by @crutcher in #4143
Add contiguous reindexing for non-contiguous layer indices by @antimora in #4150
Add warmup epochs to MetricEarlyStoppingStrategy. (#3970) by @crutcher in #4041
fix(onnx): Use activation function for GELU codegen instead of non-existent tensor method by @antimora in #4161
Refactor more basic ops by @laggui in #4156
Refactor LocalCollectiveServer for improved clarity and error handling by @crutcher in #4126
Fix typo in comment for logger_task function by @crutcher in #4159
Refactor configurable backend tests (no more testgen macros) by @laggui in #4129
Zero-copy loading for embedded burnpack weights by @antimora in #4154
Fix candle cuda imports by @laggui in #4171
Backends no longer depend on burn-tensor, but strictly burn-backend by @laggui in #4169
Chore/update cubek cubecl by @nathanielsimard in #4172
Add ONNX CumSum operator support by @antimora in #4162
Add backend supports_dtype by @laggui in #4155
Fix attention shapes and out rank by @laggui in #4192
Fix matmul & reduce execute fuse no autotune by @laggui in #4193
Fix output dtype for argmin / argmax by @laggui in #4195
Add flatten_dims method to Shape and refactor tensor flattening API by @crutcher in #4189
Return slice for each dimension in shape by @laggui in #4152
Make xtask validate run no-std checks first. by @crutcher in #4198
Fix: CubeCL Reduce by @nathanielsimard in #4197
Reorganize and tracing::instrument collective operations. by @crutcher in #4157
Log running values by @Charles23R in #4199
Remove global ONNX opset version restriction, recommend opset 16 by @antimora in #4168
Fix dtype preservation when loading tensors in burn-store by @antimora in #4194
Fix TchTensor::from_data bf16 by @laggui in #4203
Perf/reduce cpu + Fix OOB by @nathanielsimard in #4204
feat: Implicit GEMM weight gradients for convolution by @wingertge in #4182
Fix checkpoint and summary log level by @J-F-Liu in #4201
fix: handle 1D slope when importing prelu from onnx by @mertalev in #4205
Zero-copy tensor loading for NdArray backend by @antimora in #4178
Fix quantized tensor storage data length calculation by @antimora in #4180
Fix handling scalar scan outputs in ONNX loop nodes by @antimora in #4210
Perf/improve reduce autotuning + plane non uniform control flow check by @nathanielsimard in #4208
Add ONNX external data support for models >2GB by @antimora in #4158
Update/cubek by @louisfd in #4214
Refactor: Replace canonicalize_dim with expect_dim by @crutcher in #4196
fix: handle negative indices in onnx gather op by @mertalev in #4207
Refactor/cube dim by @nathanielsimard in #4217
Refactor: Consolidate shape and slice error handling into ExpressionError by @crutcher in #4218
Update: CubeK by @louisfd in #4222
feat: Accelerated convolution data gradient by @wingertge in #4220
Fix repeat 0 times by @laggui in #4216
Burn train api refactor by @Charles23R in #4223
Chore/pre release 6 by @nathanielsimard in #4224

Contributors

antimora, wingertge, and 9 other contributors

Assets 2

08 Dec 14:53

nathanielsimard

v0.20.0-pre.5

42edc63

v0.20.0-pre.5 Pre-release

Pre-release

What's Changed

Bump version by @nathanielsimard in #4102
Handle empty tensors in cat and slice_assign ops by @antimora in #4095
Add network utilities to burn-std by @laggui in #4104
Remove RefCell from onnx-ir Arguments by @antimora in #4094
Fix raspberry pi pico example not compiling by @BjornTheProgrammer in #4034
Flash Attention module by @louisfd in #4089
[Breaking] Add IndexingUpdateOp to scatter and select_assign by @laggui in #4070
Feat/improve errors by @nathanielsimard in #4110
Add 256-byte tensor alignment to burnpack format for mmap zero-copy support by @antimora in #4100
Add CrossAttention module to burn-nn by @huy209vn in #4101
Add reflect and edge padding modes to tensor.pad by @antimora in #4105
Add LSTM operator support with configurable activations by @antimora in #4106
Add memory-mapped ONNX loading with lazy tensor data by @antimora in #4097
Refactor RemoteDevice to use a thread-safe global address registry. by @crutcher in #4113
Partial cleanup of RemoteSender api. by @crutcher in #4108
Move backend traits and types to burn-backend by @laggui in #4111
Fix remote sync error by @laggui in #4117
Small LSTM clean up of unused variable by @antimora in #4116
Fix/autotune checks by @nathanielsimard in #4114
Include katex header as symlink by @laggui in #4118
chore: Update cubecl by @wingertge in #4120
Fix GLU and quiet softmax activations by @laggui in #4121
Migrate ONNX import to burnpack format (removing Record type) by @antimora in #4122
Combined PRs by @github-actions[bot] in #4140
Chore/pre release 5 by @nathanielsimard in #4141

Contributors

antimora, wingertge, and 6 other contributors

Assets 2

01 Dec 19:15

nathanielsimard

v0.20.0-pre.4

c9af669

v0.20.0-pre.4 Pre-release

Pre-release

What's Changed

Make TransformerEncoderLayer fields public by @Mnwa in #4053
Feature muon by @NewBornRustacean in #3925
Implement FromStr for Slice with parsing and error handling by @crutcher in #3983
chore: Update to cubecl scalar refactor by @wingertge in #4062
refactor: cubecl Runtime trait by @wingertge in #4065
Fix scatter values backward by @khoek in #4064
Refactor/autotuner by @nathanielsimard in #4068
Fix MPS "Placeholder storage has not been allocated" error for embedding operations by @antimora in #4073
Remove burn-import abstraction layer and use onnx-ir types directly by @antimora in #4033
More correctness fixes in autodiff ops by @khoek in #4069
Fix transaction read by @laggui in #4074
Feat/error handling cubecl by @nathanielsimard in #4076
Move types from burn-tensor by @laggui in #4050
burn-store enhancements for troubleshooting and new enum skip flag by @antimora in #4051
Re-enabled no-std support for safetensors store by @antimora in #4071
Fix tch bf16 kind by @laggui in #4088
Feat/runtime error by @nathanielsimard in #4079
Fix ConstantOfShape output size determination by @antimora in #4085
Fix reduce codegen to use turbofish for squeeze_dims by @antimora in #4086
Fix Expand operation to use ONNX max-semantics by @antimora in #4082
Add ONNX GridSample op support and tests by @antimora in #4084
Fix Slice operation to handle empty ranges by @antimora in #4083
Add RF-DETR model check for burn-import by @antimora in #4087
Fix cubecl by @BjornTheProgrammer in #4092

Contributors

khoek, antimora, and 7 other contributors

Assets 2

24 Nov 17:37

nathanielsimard

v0.20.0-pre.3

88d662d

v0.20.0-pre.3 Pre-release

Pre-release

What's Changed

Node to Enum-based design for type-safe IR by @antimora in #4019
Ignore number_prefix advisory from tokenizers by @laggui in #4037
BUG: Fixed burn version by @Marc-AnthonyG in #4035
Refactor/dtype cubecl by @nathanielsimard in #4032
Fix parallel spelling error. by @crutcher in #4046
Refactor MetricEntry by @Charles23R in #4031
Bump actions/checkout from 5 to 6 by @dependabot[bot] in #4047
Refactor of burn fusion and burn cubecl fusion by @nathanielsimard in #4044
update cubecl by @louisfd in #4045
Cleanup autodiff unused roots by @laggui in #4039
Fix autotuner by @nathanielsimard in #4049
Combined PRs by @github-actions[bot] in #4059
Fix floating point norm test tolerance by @laggui in #4061
Add support for yolo12x model variant check by @antimora in #4048
Chore: Prepare pre-release 3 by @nathanielsimard in #4060

Contributors

antimora, laggui, and 6 other contributors

Assets 2

17 Nov 15:26

nathanielsimard

v0.20.0-pre.2

cc0f22a

v0.20.0-pre.2 Pre-release

Pre-release

What's Changed

Add ONNX control flow operators: If, Loop, and Scan by @antimora in #3936
Fix fusion reduce local already registered as output by @laggui in #4014
Silero VAD ONNX model verification by @antimora in #3999
Feat/pinned memory staging by @nathanielsimard in #4016
Refactor metric logger : epoch summary and multiple entries at once by @Charles23R in #4017
Fix cuda mem error by @nathanielsimard in #4020
Add GaussianNoise layer by @kul-sudo in #4022
Fix remainder int by @laggui in #4015
Feat/optim/distributed by @nathanielsimard in #4018
Cleanup quantization strategy (CPU ref, ndarray only) by @laggui in #4023
chore: remove repetitive words in comment by @black5box in #4029
feat: Enable tuning specialized matmul by @wingertge in #4026

Contributors

antimora, wingertge, and 5 other contributors

Assets 2

11 Nov 15:35

nathanielsimard

v0.20.0-pre.1

913ddc0

v0.20.0-pre.1 Pre-release

Pre-release

Summary

This release includes significant performance improvements, bug fixes, and architectural refactoring.
Key Improvements:

TMA autotuning and MMA matmul tuning enabled for better performance
ONNX-IR refactored to an op/node-centric architecture IR refactored to define outputs as a function of the operation

Bug Fixes:

Fixed autodiff graph cleanup issues (multiple fixes for deferred/consumed nodes)
Fixed Linear layer panic when output size is one
Fixed PyTorch pickle reader regression with integer dict keys
Fixed RoPE sum_dim calculation
Fixed tensor *_like dtype preservation
Fixed squeeze check for D2 > 0
Fixed QLinear implementation
Fixed async barrier & TMA checks

New Features:

Added matvec operation
Added support for custom learning strategies
Added Candle device seeding
Added Shape::ravel_index for row-major raveling
Generalized linalg::outer semantics with new linalg::outer_dim
Implemented error handling for DataError
Added square() optimization where appropriate

Assets 2

06 Nov 16:18

laggui

v0.19.1

a6da424

v0.19.1 Latest

Latest

Bug Fixes & Improvements

Autodiff: fixed RAM memory leak with correct graph cleanup (#3957 #3982) @laggui
Better memory reuse: improved sliced memory pool implementation (#3941) @nathanielsimard
Cuda: update cudarc, auto-detect CUDA version and fix some 12.8 features (CubeCL #1008) @wingertge
Quantized Linear: fixed fusion configuration to fuse more precisions (#3941) @nathanielsimard
PyTorch import: fixed pickle reader regression with integer dictionary keys (#3978) @laggui
Docs: switched to doc_cfg to fix docs.rs builds (#3979) @laggui
Tensor API fixes:
- *_like preserves dtype (#3953) @crutcher
- RotaryEncoding sum dimension for 3D input (#3954) @laggui
- squeeze check for output rank > 0 (#3946) @laggui
- Linear for input/output rank 1 (#3966) @lucasmdjl

Contributors

wingertge, laggui, and 3 other contributors

Assets 2

28 Oct 17:00

laggui

v0.19.0

0767a9a

v0.19.0

Summary

This release brings major improvements to enable efficient distributed training, quantization, and CPU support in Burn.

To achieve true multi-GPU parallelism, we had to rethink several core systems: we implemented multi-stream execution to keep all GPUs busy, optimized device transfers to avoid unnecessary synchronization, and redesigned our locking strategies to eliminate bottlenecks in autotuning, fusion, and autodiff. We also introduced burn-collective for gradient synchronization and refactored our training loop to support different distributed training strategies.

Additionally, we added comprehensive quantization support, allowing models to use significantly less memory while maintaining performance through fused dequantization and optimized quantized operations.

Finally, we introduced a new CPU backend powered by MLIR and LLVM, bringing the same JIT compilation, autotuning, and fusion capabilities from our GPU backends to CPU execution.

As with previous releases, this version includes various bug fixes, further optimizations and enhanced documentation. Support for ONNX models has also been expanded, with additional operators and bug fixes for better operator coverage.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.

Learning Strategy

We refactored the Learner to support better distributed training strategies. Instead of registering a list of device(s), you now specify a training strategy.

  let learner = LearnerBuilder::new(artifact_dir)
      .metric_train_numeric(AccuracyMetric::new())
      .metric_valid_numeric(AccuracyMetric::new())
      .metric_train_numeric(LossMetric::new())
      .metric_valid_numeric(LossMetric::new())
      .with_file_checkpointer(CompactRecorder::new())
-     .devices(vec![device.clone()])
+     .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
      .num_epochs(config.num_epochs)
      .summary()
      .build(
          config.model.init::<B>(&device),
          config.optimizer.init(),
          config.learning_rate,
      );

Learner Training Result

The Learner previously lacked an evaluation loop. We extended its return type to include all training states in a TrainingResult, which includes the trained model and a metrics renderer.

- let model_trained = learner.fit(dataloader_train, dataloader_valid);
+ let result = learner.fit(dataloader_train, dataloader_valid);

- model_trained
+ result
+    .model
     .save_file(format!("{artifact_dir}/model"), &CompactRecorder::new())
     .expect("Trained model should be saved successfully");

This enables the renderer to be reused by the new evaluator so that training and evaluation metrics appear together in the TUI dashboard:

let mut renderer = result.renderer;
let evaluator = EvaluatorBuilder::new(artifact_dir)
    .renderer(renderer)
    .metrics((AccuracyMetric::new(), LossMetric::new()))
    .build(result.model.clone());

evaluator.eval(name, dataloader_test);

Interface Changes

`Config`

The Config trait now requires Debug:

- #[derive(Config)]
+ #[derive(Config, Debug)]
  pub struct TrainingConfig {
      // ...
  }

`BatchNorm`

BatchNorm no longer requires the spatial dimension generic:

  #[derive(Module, Debug)]
  pub struct ConvBlock<B: Backend> {
      conv: nn::conv::Conv2d<B>,
-     norm: BatchNorm<B, 2>,
+     norm: BatchNorm<B>,
      pool: Option<MaxPool2d>,
      activation: nn::Relu,
  }

`Backend::seed`

Seeding is now device-specific:

- B::seed(seed);
+ B::seed(&device, seed);

`Tensor`

For consistency with other methods like unsqueeze() / unsqueeze_dim(dim), squeeze(dim) was renamed:

- tensor.squeeze(dim)
+ tensor.squeeze_dim(dim)

We've also added a tensor.squeeze() method which squeezes all singleton dimensions.

Finally, we removed tensor ^ T syntax, which was clunky.

- use burn::tensor::T;
- tensor ^ T
+ tensor.t()

tensor.t() is also a simple alias for tensor.transpose().

Module & Tensor

Fix unsqueeze rank check (#3429) @laggui
Feat/quant block (#3442) @laggui
Kill tensor^T magic transpose marker in favor of tensor.t(). (#3452) @crutcher
ADD GLU activation function (#3444) @bn-c
Add quantization params precision (#3453) @laggui
Improve select_assign check (#3483) @laggui
Add grid_sample function (#3495 #3523 #3522) @Cielbird
save_tensor_as_image utility (#3520) @Cielbird
Add affine_grid_2d (#3526) @Cielbird
ADD missing Debug derive for embedding (#3547) @bn-c
Dot Product Op (#3537) @kikefdezl
Lift .full()/.full_like() into base Tensor - support Tensor<B, D, Bool>::full()/full_like(). (#3562) @crutcher
Make Distribution::Default the Default::default(). (#3582) @crutcher
Implement int matmul (#3575) @wingertge
Feat/quant formats (#3613) @laggui
Switch Tensor::swap_dims/permute to AsIndex dim support. (#3619) @crutcher
Tensor::flatten() => AsIndex dims support. (#3620) @crutcher
Remove D param from BatchNorm<B, D>. (#3625) @crutcher
nn.activation; Activation (#3603 #3693) @crutcher
Add q4 q2 quantization (#3617) @laggui
Introduce NormLayer abstraction for unified normalization layers. (#3630) @crutcher
Add dtype to trait creation ops (#3670) @laggui
Make Config require Debug (#3689) @crutcher
Add NormalizationConfig::with_num_features() and related (#3688) @crutcher
Module quantization w/ tests (#3637) @nathanielsimard
Add NumPy-like take operation with multi-dimensional index support (#3681) @antimora
Added trace and diag with batch support for linalg crate (#3703) @niklund
Add step support to tensor slice operations (#3748) @antimora
Tensor::unfold(dim, size, step) (#3751 #3782 #3783) @crutcher
Slice assign with steps (#3776) @antimora
Add bool_xor operation for boolean tensors (#3785) @crutcher
[Breaking] Make squeeze/squeeze_dim consistent with other APIs (#3790) @laggui
Add cross product (#3743) @SinanGncgl
Enable stepped slicing for slice_fill and complete slice API cleanup (#3784) @antimora
Tensor::rank() (#3797) @crutcher
AsIndex dim handling for Numeric ops (#3795) @crutcher
Add outer and outer_batch ops in linalg (#3786) @huy209vn
Tensor::_dims() (#3811) @crutcher
Add tensor.cumsum(dim) first implementation (#3806) @antimora
slice_fill() should pick a compatible dtype (#3826) @crutcher
Default LU decomposition implementation (#3816) @DimitriTimoz
Add tensor.square and fast-path int-power exponents. (#3847) @crutcher
Add cumulative operations: cumprod, cummin, and cummax (#3819) @antimora
Add Tensor::sum_dims_squeeze(dims) (#3817) @crutcher
Allow linear to use quantized matmul (#3913) @wingertge

Datasets & Training

Pre-Shuffle Multithread DataLoaders on Shuffle (#3390) @crutcher
PixelDepth + Copy (#3419) @crutcher
Add Dice-Sorenson Coefficient Metric (#3407) @MathijsdeBoer
Add SelectionDataset, refactor ShuffledDataset, and add transform tests. (#3406) @crutcher
Evenly distribute complete chunks/batches across partial dataset splits (#3476) @laggui
Distributed Data Parallel (#3456) @Cielbird
Use tensor ops for clip_by_norm (#3485) @laggui
SamplerDataset distribution fix; constructors and builder. (#3490) @crutcher
Unify transform usage of RngOptions. (#3577) @crutcher
Fix bugs with ddp learning (#3581) @Cielbird
Add support for CIFAR-10 and CIFAR-100 datasets (#3579) @buttfa
Add with_interrupter for LearnerBuilder (#3611) @amfaber
Improved Burn Train (#3614 #3935) @nathanielsimard @laggui
Add 'TextFolderDataset' struct and AgNewsDataset (#3698) @buttfa
Add PerplexityMetric for language model evaluation (#3707) @TheDarkchip
Adding CER/WER metrics (#3418) @yazanmashal03
Fix/autodiff/multi threads (#3793) @nathanielsimard
Add cautious_weight_decay to AdamW optimizer. (#3869) @crutcher
Fix evaluator dataloader device (#3893) @laggui

Backends

Migrate to new cubecl multi tensor handle changes (#3136) @wingertge
More memory control with scoped static memory management (#3410) @nathanielsimard
Feat/fusion quant (#3454) @nathanielsimard
Expose client utilities (#3559) @allenqm
New CPU backend based on MLIR (#3411) @marcantoinem
feat: ndarray dynamic tensor types and int tensor cast (#3647) @wingertge
Implement optimized bool_select for primary backends (#3710) @TheDarkchip
Add backend level is_nan / is_inf implementations (#3809) @laggui
Feat/persistent memory (#3842) @nathanielsimard
feat: add backend implementations for Trunc op (#3860) @mooori

Bug Fixes

Fix ndarray interpolate coord precision at boundaries (#3481) @laggui
Fix ndarray conv2d groups channels (#3415) @laggui
Fix candle mask broadcasting (#3489) @laggui
Update cubecl: fix wgpu vec to scalar cast (#3496) @Cielbird
Fix/conv2d groups backward (#3521) @laggui
Fix/conv3d backward groups (#3533) @laggui
[Fix] Add some missing handling for flex32 (#3551) @wingertge
Fix backward scatter dim (#3555) @laggui
fix: Use correct datatype when filling boolean tensors (#3593) @wingertge
fix: Ensure output layout is the same for non-inplace SIMD ops in ndarray (#3604) @wingertge
Fix scalar binop not contiguous (#3636) @laggui
Fix dtype dispatch in cubecl module ops (#3658) @laggui
Fix wgpu bool and/or (#3664) @laggui
Fix tch bool ones and rand int (#3684) @laggui
fix: Select assign + bool cast (#3730) @wingertge
Fix register_float_tensor to use the correct dtype (#3774) @A2va
Fix: autotune errors with fu...

Contributors

antimora, wingertge, and 35 other contributors

Assets 2

18 Jul 16:27

laggui

v0.18.0

f5d889d

v0.18.0

Summary

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for:

Matrix-vector (mat@vec)
Vector-matrix (vec@mat)
Inner product
Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.

Fusion Enhancements

Improved reliability and performance of Burn Fusion through advanced optimizations.
Added support for basic dead code elimination.
Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

profiling: Controls performance profiling and logging.
autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book.

As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig

let pool = MaxPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

MaxPool1dConfig

let pool = MaxPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

AvgPool2dConfig

let pool = AvgPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();

AvgPool1dConfig

let pool = AvgPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

Module & Tensor

Add tensor grid::meshgrid (#3107 #3191) @crutcher
Add scalar tensor operations (#3127) @ArthurBrussee
Orthogonal initialization (#3109) @dymat
Support importing safetensors format (#2721) @wandbrandon @antimora
Add burn::linalg norms (#3131) @crutcher
Extract Linear.forward to nn::functional::linear (#3147) @crutcher
Base impl of matmul for Int tensor (#3201) @crutcher
(perf) generate_mask functions optimizations (#3203) @tafia
Add CosineEmbeddingLoss module and cosine_similarity function (#3207) @antimora
Tensor::slice_fill() (#3221 #3223) @crutcher
Base impl of tensor.slice_dim(dim, range) (#3235) @crutcher
Support shifting pre-computed RoPE values (#3275) @laggui
Improve RoPE partial shift case (#3290) @laggui
Add tensor.roll() and improve AsIndex (renamed IndexConversion) (#3281) @crutcher
[Breaking] Update pooling default strides to match kernel size (#3338) @lucianyao
Add is_finite tensor element wise op and fix is_close/all_close inf (#3341) @jonboh

Backends

[Perf] Interpolate optimizations (#3077) @wingertge
[Perf] Slice assign (#3069) @wingertge
Add multi stage conv (#3105) @wingertge
[Perf] Convolution migration to NHWC (#3090) @wingertge
Merge different convolution dimensional kernels (#3115) @wingertge
Support reduce mixed precision accumulation w/ fusion (#3132) @nathanielsimard
Update remote backend (#3175) @Cielbird
Feat/autotune optional (#3188) @nathanielsimard
cubecl unit matmul (#3214) @louisfd
Update CubeCL for client based profiling (#3222) @ArthurBrussee
Update cubecl unit matmul double buffered (#3233) @louisfd
Burn-remote to_device function (#3189) @Cielbird
Add Drop operation for fusion (#3263) @nathanielsimard
Lazy tensor downloading in burn-remote (#3276) @Cielbird
Improve specialized matmul (#3304) @louisfd
Add autotune priority (#3347 #3378) @nathanielsimard
Fix local tuner deadlock (#3384) @nathanielsimard
Fix fusion wasm unsafe input (#3385 #3386) @nathanielsimard

Bug Fixes

Fix WASM deadlock by really properly not capturing locks (#3123) @ArthurBrussee
Fix burn-cubecl with autotune disabled (#3141) @wingertge
Fix fusion multiple reshapes (#3220) @nathanielsimard
Fix/fusion multiple streams (#3297) @nathanielsimard
Fix gather broadcasted indices in kernel impl and fusion (#3337) @laggui
Fix rand interval (#3321) @laggui
Restrict binary op lhs/rhs alias (#3349) @laggui
Fix sum fallback when atomic add is not supported (#3369) @laggui

Documentation & Examples

Update pytorch-model.md with a new troubleshooting help (#3081) @antimora
Contributor example instructions (#3153) @AshAnand34
Update README.md with DeepWiki badge (#3192) @antimora
Add recursion_limit macro to getting started exemples code (#3238) @Marc-AnthonyG
KaTeX for Mathematical expressions in docstrings (#3278) @BhavyeMathur
Add Metal backend support to custom-image-dataset (#3335 #3354) @TsaoLun
Add link to license in README badge (#3356) @Olexandr88

Fixes

Fix typo in Burn Book (#3113) @danny-burrows
fix typos (#3186) @omahs
Fix Typos in Documentation Comments (#3280) @leopardracer
Fix typo in code documentation for BurnGraph codegen (#3286) @kilavvy
Fix error messages from tensor checks for flatten (#3319) @NoVegetable
Fix broken link to burn-tch (#3365) @dbdr
Update documentation description for nonzero and nonzero_async (#3368) @catch-twenty-two

ONNX Support

ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info (#3037) @antimora
Restrict ONNX opset to 16 and up (#3051) @antimora
Allow Shape input type for Slice operation (#3092) @antimora
Support onnx and, or & xor nodes (#3173) @tye-singwa
Add support ONNX instance norm (#3177) @tye-singwa
Onnx ceil & round (#3225) @tye-singwa
Add support onnx group norm (#3245) @tye-singwa
Add onnx SpaceToDepth / DepthToSpace (#3277) @tye-singwa
Fix onnx topological sort check (#3284) @tye-singwa
Add onnx ArgMin node (#3285) @tye-singwa
Add support onnx size (#3301) @tye-singwa
Support flexible backend selection for import tests (#3372 #3380) @lucianyao
Fix ONNX node name sanitization and allow ai.onnx.ml domain (#3371) @antimora

Enhancements

Replace some powf->powi (#3152) @ArthurBrussee
Improve fusion compilation speed (#3155) @nathanielsimard
Perf/remove repeat dim (#3183) @nathanielsimard
Perf: Fusion search for composed optimization (#3258) @nathanielsimard
Improve matmul selector (#3307 #3343 #3350 #3376) @nathanielsimard

Refactoring

Refactor CubeCL slices (#3104) @nathanielsimard
CubeCL init refactor (#3128) @nathanielsimard
Refactor narrow, chunk and split (#3137) @laggui
Refactor quantization scheme (#3042) @maxtremblay
Migrated prng (random) to CubeCL (#3165 #3170) @Cielbird
Break down test_onnx.rs into test subdirectories (#3144) @antimora
Refactor: Move op_configuration.rs from burn-import to onnx-ir (#3126) @antimora
Fix relative cmp + debug tools (#3197) @nathanielsimard
Refactor cubecl line size matmul (#3219) @louisfd
Absolute tolerance is too tight for strict/balanced/permissive (#3242) @laggui
Fix clippy rust 1.88 and cargo run checks usage (#3325 #3320) @laggui
Remove hip os cfg flags (#3336) @laggui
Update cubecl matmul refactor / docs (#3366) @louisfd

Miscellaneous

Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
Replace run-checks scripts with command alias (#3118) @laggui
Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
Add cubecl.toml config (#3150) @nathanielsimard
Use CUBECL_DEBUG_OPTION=profile macos ci (#...

Contributors

antimora, lucianyao, and 29 other contributors

Assets 2

03 Jun 13:07

laggui

v0.17.1

179731b

v0.17.1

Bug Fixes & Improvements

Downgrade to zip 2.4.2 (fixes #3224) @laggui
Fix non contiguous bug with comparison op (#3241) @nathanielsimard
Fix/reduce fusion (#3172) @nathanielsimard
Fix: fusion multi-block scalar index sharing (#3167) @nathanielsimard
Fix to NdArray int_max_dim bug (#3140) @crutcher
Make is_contiguous check common (#3083) @laggui
Fix clamp min/max line size > 1 (#3078) @laggui
Fix vectorization problem with fusion on reshaped not contiguous tensors (#3075) @nathanielsimard

Contributors

laggui, crutcher, and nathanielsimard

Assets 2

Releases: tracel-ai/burn

v0.20.0-pre.6

What's Changed

Contributors

Uh oh!

v0.20.0-pre.5

What's Changed

Contributors

Uh oh!

v0.20.0-pre.4

What's Changed

Contributors

Uh oh!

v0.20.0-pre.3

What's Changed

Contributors

Uh oh!

v0.20.0-pre.2

What's Changed

Contributors

Uh oh!

v0.20.0-pre.1

Summary

Bug Fixes:

New Features:

Uh oh!

v0.19.1

Bug Fixes & Improvements

Contributors

Uh oh!

v0.19.0

Summary

Changelog

Learning Strategy

Learner Training Result

Interface Changes

Config

BatchNorm

Backend::seed

Tensor

Module & Tensor

Datasets & Training

Backends

Bug Fixes

Contributors

Uh oh!

v0.18.0

Summary

Matrix Multiplication Improvements

Fusion Enhancements

Multi-Threading and Memory Management

CubeCL Config

Changelog

Module & Tensor

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Enhancements

Refactoring

Miscellaneous

Contributors

Uh oh!

v0.17.1

Bug Fixes & Improvements

Contributors

Uh oh!

`Config`

`BatchNorm`

`Backend::seed`

`Tensor`