Releases: tracel-ai/burn
v0.20.0-pre.6
What's Changed
- doc warning fix by @crutcher in #4130
- Fix tch bf16 into_data by @laggui in #4142
- Update raspberry-pi-pico example to use the Pico 2, and burnpack by @BjornTheProgrammer in #4132
- Unify all_reduce
LocalCollectiveClientoperation handling. by @crutcher in #4125 - Add direct tensor snapshot retrieval API to ModuleStore by @antimora in #4131
- Fix outer-scope variable references in ONNX subgraphs (If/Loop/Scan) by @antimora in #4119
- Add removed docs for tensor equal_elem by @laggui in #4145
- Add ceil_mode support to pooling operations (MaxPool, AvgPool) by @antimora in #4112
- chore: Update cubecl by @wingertge in #4134
- Implement Slice iterator and utility methods. by @crutcher in #4042
- Bump peter-evans/create-pull-request from 7 to 8 by @dependabot[bot] in #4148
- Add slice_dyn, slice_assign_dyn, and slice_fill_dyn variants. by @crutcher in #4127
- Add Reshape scalar optimization and Gather scalar input support by @antimora in #4146
- Shape FromStr/ToString by @crutcher in #4143
- Add contiguous reindexing for non-contiguous layer indices by @antimora in #4150
- Add warmup epochs to
MetricEarlyStoppingStrategy. (#3970) by @crutcher in #4041 - fix(onnx): Use activation function for GELU codegen instead of non-existent tensor method by @antimora in #4161
- Refactor more basic ops by @laggui in #4156
- Refactor
LocalCollectiveServerfor improved clarity and error handling by @crutcher in #4126 - Fix typo in comment for logger_task function by @crutcher in #4159
- Refactor configurable backend tests (no more testgen macros) by @laggui in #4129
- Zero-copy loading for embedded burnpack weights by @antimora in #4154
- Fix candle cuda imports by @laggui in #4171
- Backends no longer depend on
burn-tensor, but strictlyburn-backendby @laggui in #4169 - Chore/update cubek cubecl by @nathanielsimard in #4172
- Add ONNX CumSum operator support by @antimora in #4162
- Add backend supports_dtype by @laggui in #4155
- Fix attention shapes and out rank by @laggui in #4192
- Fix matmul & reduce execute fuse no autotune by @laggui in #4193
- Fix output dtype for argmin / argmax by @laggui in #4195
- Add
flatten_dimsmethod toShapeand refactor tensor flattening API by @crutcher in #4189 - Return slice for each dimension in shape by @laggui in #4152
- Make xtask validate run no-std checks first. by @crutcher in #4198
- Fix: CubeCL Reduce by @nathanielsimard in #4197
- Reorganize and tracing::instrument collective operations. by @crutcher in #4157
- Log running values by @Charles23R in #4199
- Remove global ONNX opset version restriction, recommend opset 16 by @antimora in #4168
- Fix dtype preservation when loading tensors in burn-store by @antimora in #4194
- Fix TchTensor::from_data bf16 by @laggui in #4203
- Perf/reduce cpu + Fix OOB by @nathanielsimard in #4204
- feat: Implicit GEMM weight gradients for convolution by @wingertge in #4182
- Fix checkpoint and summary log level by @J-F-Liu in #4201
- fix: handle 1D slope when importing prelu from onnx by @mertalev in #4205
- Zero-copy tensor loading for NdArray backend by @antimora in #4178
- Fix quantized tensor storage data length calculation by @antimora in #4180
- Fix handling scalar scan outputs in ONNX loop nodes by @antimora in #4210
- Perf/improve reduce autotuning + plane non uniform control flow check by @nathanielsimard in #4208
- Add ONNX external data support for models >2GB by @antimora in #4158
- Update/cubek by @louisfd in #4214
- Refactor: Replace
canonicalize_dimwithexpect_dimby @crutcher in #4196 - fix: handle negative indices in onnx gather op by @mertalev in #4207
- Refactor/cube dim by @nathanielsimard in #4217
- Refactor: Consolidate shape and slice error handling into
ExpressionErrorby @crutcher in #4218 - Update: CubeK by @louisfd in #4222
- feat: Accelerated convolution data gradient by @wingertge in #4220
- Fix repeat 0 times by @laggui in #4216
- Burn train api refactor by @Charles23R in #4223
- Chore/pre release 6 by @nathanielsimard in #4224
v0.20.0-pre.5
What's Changed
- Bump version by @nathanielsimard in #4102
- Handle empty tensors in cat and slice_assign ops by @antimora in #4095
- Add network utilities to
burn-stdby @laggui in #4104 - Remove RefCell from onnx-ir Arguments by @antimora in #4094
- Fix raspberry pi pico example not compiling by @BjornTheProgrammer in #4034
- Flash Attention module by @louisfd in #4089
- [Breaking] Add
IndexingUpdateOptoscatterandselect_assignby @laggui in #4070 - Feat/improve errors by @nathanielsimard in #4110
- Add 256-byte tensor alignment to burnpack format for mmap zero-copy support by @antimora in #4100
- Add CrossAttention module to burn-nn by @huy209vn in #4101
- Add reflect and edge padding modes to tensor.pad by @antimora in #4105
- Add LSTM operator support with configurable activations by @antimora in #4106
- Add memory-mapped ONNX loading with lazy tensor data by @antimora in #4097
- Refactor
RemoteDeviceto use a thread-safe global address registry. by @crutcher in #4113 - Partial cleanup of RemoteSender api. by @crutcher in #4108
- Move backend traits and types to
burn-backendby @laggui in #4111 - Fix remote sync error by @laggui in #4117
- Small LSTM clean up of unused variable by @antimora in #4116
- Fix/autotune checks by @nathanielsimard in #4114
- Include katex header as symlink by @laggui in #4118
- chore: Update cubecl by @wingertge in #4120
- Fix GLU and quiet softmax activations by @laggui in #4121
- Migrate ONNX import to burnpack format (removing Record type) by @antimora in #4122
- Combined PRs by @github-actions[bot] in #4140
- Chore/pre release 5 by @nathanielsimard in #4141
v0.20.0-pre.4
What's Changed
- Make TransformerEncoderLayer fields public by @Mnwa in #4053
- Feature muon by @NewBornRustacean in #3925
- Implement
FromStrforSlicewith parsing and error handling by @crutcher in #3983 - chore: Update to cubecl scalar refactor by @wingertge in #4062
- refactor: cubecl Runtime trait by @wingertge in #4065
- Fix scatter values backward by @khoek in #4064
- Refactor/autotuner by @nathanielsimard in #4068
- Fix MPS "Placeholder storage has not been allocated" error for embedding operations by @antimora in #4073
- Remove burn-import abstraction layer and use onnx-ir types directly by @antimora in #4033
- More correctness fixes in autodiff ops by @khoek in #4069
- Fix transaction read by @laggui in #4074
- Feat/error handling cubecl by @nathanielsimard in #4076
- Move types from
burn-tensorby @laggui in #4050 - burn-store enhancements for troubleshooting and new enum skip flag by @antimora in #4051
- Re-enabled no-std support for safetensors store by @antimora in #4071
- Fix tch bf16 kind by @laggui in #4088
- Feat/runtime error by @nathanielsimard in #4079
- Fix ConstantOfShape output size determination by @antimora in #4085
- Fix reduce codegen to use turbofish for squeeze_dims by @antimora in #4086
- Fix Expand operation to use ONNX max-semantics by @antimora in #4082
- Add ONNX GridSample op support and tests by @antimora in #4084
- Fix Slice operation to handle empty ranges by @antimora in #4083
- Add RF-DETR model check for burn-import by @antimora in #4087
- Fix cubecl by @BjornTheProgrammer in #4092
v0.20.0-pre.3
What's Changed
- Node to Enum-based design for type-safe IR by @antimora in #4019
- Ignore number_prefix advisory from tokenizers by @laggui in #4037
- BUG: Fixed burn version by @Marc-AnthonyG in #4035
- Refactor/dtype cubecl by @nathanielsimard in #4032
- Fix parallel spelling error. by @crutcher in #4046
- Refactor MetricEntry by @Charles23R in #4031
- Bump actions/checkout from 5 to 6 by @dependabot[bot] in #4047
- Refactor of burn fusion and burn cubecl fusion by @nathanielsimard in #4044
- update cubecl by @louisfd in #4045
- Cleanup autodiff unused roots by @laggui in #4039
- Fix autotuner by @nathanielsimard in #4049
- Combined PRs by @github-actions[bot] in #4059
- Fix floating point norm test tolerance by @laggui in #4061
- Add support for yolo12x model variant check by @antimora in #4048
- Chore: Prepare pre-release 3 by @nathanielsimard in #4060
v0.20.0-pre.2
What's Changed
- Add ONNX control flow operators:
If,Loop, andScanby @antimora in #3936 - Fix fusion reduce local already registered as output by @laggui in #4014
- Silero VAD ONNX model verification by @antimora in #3999
- Feat/pinned memory staging by @nathanielsimard in #4016
- Refactor metric logger : epoch summary and multiple entries at once by @Charles23R in #4017
- Fix cuda mem error by @nathanielsimard in #4020
- Add GaussianNoise layer by @kul-sudo in #4022
- Fix remainder int by @laggui in #4015
- Feat/optim/distributed by @nathanielsimard in #4018
- Cleanup quantization strategy (CPU ref, ndarray only) by @laggui in #4023
- chore: remove repetitive words in comment by @black5box in #4029
- feat: Enable tuning specialized matmul by @wingertge in #4026
v0.20.0-pre.1
Summary
This release includes significant performance improvements, bug fixes, and architectural refactoring.
Key Improvements:
- TMA autotuning and MMA matmul tuning enabled for better performance
- ONNX-IR refactored to an op/node-centric architecture IR refactored to define outputs as a function of the operation
Bug Fixes:
- Fixed autodiff graph cleanup issues (multiple fixes for deferred/consumed nodes)
- Fixed Linear layer panic when output size is one
- Fixed PyTorch pickle reader regression with integer dict keys
- Fixed RoPE sum_dim calculation
- Fixed tensor *_like dtype preservation
- Fixed squeeze check for D2 > 0
- Fixed QLinear implementation
- Fixed async barrier & TMA checks
New Features:
- Added matvec operation
- Added support for custom learning strategies
- Added Candle device seeding
- Added Shape::ravel_index for row-major raveling
- Generalized linalg::outer semantics with new linalg::outer_dim
- Implemented error handling for DataError
- Added square() optimization where appropriate
v0.19.1
Bug Fixes & Improvements
- Autodiff: fixed RAM memory leak with correct graph cleanup (#3957 #3982) @laggui
- Better memory reuse: improved sliced memory pool implementation (#3941) @nathanielsimard
- Cuda: update
cudarc, auto-detect CUDA version and fix some 12.8 features (CubeCL #1008) @wingertge - Quantized Linear: fixed fusion configuration to fuse more precisions (#3941) @nathanielsimard
- PyTorch import: fixed pickle reader regression with integer dictionary keys (#3978) @laggui
- Docs: switched to
doc_cfgto fixdocs.rsbuilds (#3979) @laggui - Tensor API fixes:
v0.19.0
Summary
This release brings major improvements to enable efficient distributed training, quantization, and CPU support in Burn.
To achieve true multi-GPU parallelism, we had to rethink several core systems: we implemented multi-stream execution to keep all GPUs busy, optimized device transfers to avoid unnecessary synchronization, and redesigned our locking strategies to eliminate bottlenecks in autotuning, fusion, and autodiff. We also introduced burn-collective for gradient synchronization and refactored our training loop to support different distributed training strategies.
Additionally, we added comprehensive quantization support, allowing models to use significantly less memory while maintaining performance through fused dequantization and optimized quantized operations.
Finally, we introduced a new CPU backend powered by MLIR and LLVM, bringing the same JIT compilation, autotuning, and fusion capabilities from our GPU backends to CPU execution.
As with previous releases, this version includes various bug fixes, further optimizations and enhanced documentation. Support for ONNX models has also been expanded, with additional operators and bug fixes for better operator coverage.
For more details, check out the release post on our website.
Changelog
Breaking
We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.
Learning Strategy
We refactored the Learner to support better distributed training strategies. Instead of registering a list of device(s), you now specify a training strategy.
let learner = LearnerBuilder::new(artifact_dir)
.metric_train_numeric(AccuracyMetric::new())
.metric_valid_numeric(AccuracyMetric::new())
.metric_train_numeric(LossMetric::new())
.metric_valid_numeric(LossMetric::new())
.with_file_checkpointer(CompactRecorder::new())
- .devices(vec![device.clone()])
+ .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
.num_epochs(config.num_epochs)
.summary()
.build(
config.model.init::<B>(&device),
config.optimizer.init(),
config.learning_rate,
);Learner Training Result
The Learner previously lacked an evaluation loop. We extended its return type to include all training states in a TrainingResult, which includes the trained model and a metrics renderer.
- let model_trained = learner.fit(dataloader_train, dataloader_valid);
+ let result = learner.fit(dataloader_train, dataloader_valid);
- model_trained
+ result
+ .model
.save_file(format!("{artifact_dir}/model"), &CompactRecorder::new())
.expect("Trained model should be saved successfully");This enables the renderer to be reused by the new evaluator so that training and evaluation metrics appear together in the TUI dashboard:
let mut renderer = result.renderer;
let evaluator = EvaluatorBuilder::new(artifact_dir)
.renderer(renderer)
.metrics((AccuracyMetric::new(), LossMetric::new()))
.build(result.model.clone());
evaluator.eval(name, dataloader_test);Interface Changes
Config
The Config trait now requires Debug:
- #[derive(Config)]
+ #[derive(Config, Debug)]
pub struct TrainingConfig {
// ...
}BatchNorm
BatchNorm no longer requires the spatial dimension generic:
#[derive(Module, Debug)]
pub struct ConvBlock<B: Backend> {
conv: nn::conv::Conv2d<B>,
- norm: BatchNorm<B, 2>,
+ norm: BatchNorm<B>,
pool: Option<MaxPool2d>,
activation: nn::Relu,
}Backend::seed
Seeding is now device-specific:
- B::seed(seed);
+ B::seed(&device, seed);Tensor
For consistency with other methods like unsqueeze() / unsqueeze_dim(dim), squeeze(dim) was renamed:
- tensor.squeeze(dim)
+ tensor.squeeze_dim(dim)We've also added a tensor.squeeze() method which squeezes all singleton dimensions.
Finally, we removed tensor ^ T syntax, which was clunky.
- use burn::tensor::T;
- tensor ^ T
+ tensor.t()tensor.t() is also a simple alias for tensor.transpose().
Module & Tensor
- Fix unsqueeze rank check (#3429) @laggui
- Feat/quant block (#3442) @laggui
- Kill
tensor^Tmagic transpose marker in favor oftensor.t(). (#3452) @crutcher - ADD GLU activation function (#3444) @bn-c
- Add quantization params precision (#3453) @laggui
- Improve select_assign check (#3483) @laggui
- Add grid_sample function (#3495 #3523 #3522) @Cielbird
- save_tensor_as_image utility (#3520) @Cielbird
- Add affine_grid_2d (#3526) @Cielbird
- ADD missing Debug derive for embedding (#3547) @bn-c
- Dot Product Op (#3537) @kikefdezl
- Lift .full()/.full_like() into base Tensor - support Tensor<B, D, Bool>::full()/full_like(). (#3562) @crutcher
- Make
Distribution::DefaulttheDefault::default(). (#3582) @crutcher - Implement int matmul (#3575) @wingertge
- Feat/quant formats (#3613) @laggui
- Switch Tensor::swap_dims/permute to AsIndex dim support. (#3619) @crutcher
- Tensor::flatten() => AsIndex dims support. (#3620) @crutcher
- Remove D param from
BatchNorm<B, D>. (#3625) @crutcher - nn.activation; Activation (#3603 #3693) @crutcher
- Add q4 q2 quantization (#3617) @laggui
- Introduce
NormLayerabstraction for unified normalization layers. (#3630) @crutcher - Add dtype to trait creation ops (#3670) @laggui
- Make Config require Debug (#3689) @crutcher
- Add NormalizationConfig::with_num_features() and related (#3688) @crutcher
- Module quantization w/ tests (#3637) @nathanielsimard
- Add NumPy-like take operation with multi-dimensional index support (#3681) @antimora
- Added trace and diag with batch support for linalg crate (#3703) @niklund
- Add step support to tensor
sliceoperations (#3748) @antimora - Tensor::unfold(dim, size, step) (#3751 #3782 #3783) @crutcher
- Slice assign with steps (#3776) @antimora
- Add
bool_xoroperation for boolean tensors (#3785) @crutcher - [Breaking] Make squeeze/squeeze_dim consistent with other APIs (#3790) @laggui
- Add cross product (#3743) @SinanGncgl
- Enable stepped slicing for slice_fill and complete slice API cleanup (#3784) @antimora
- Tensor::rank() (#3797) @crutcher
- AsIndex dim handling for Numeric ops (#3795) @crutcher
- Add outer and outer_batch ops in linalg (#3786) @huy209vn
- Tensor::_dims() (#3811) @crutcher
- Add
tensor.cumsum(dim)first implementation (#3806) @antimora - slice_fill() should pick a compatible dtype (#3826) @crutcher
- Default LU decomposition implementation (#3816) @DimitriTimoz
- Add
tensor.squareand fast-path int-power exponents. (#3847) @crutcher - Add cumulative operations: cumprod, cummin, and cummax (#3819) @antimora
- Add Tensor::sum_dims_squeeze(dims) (#3817) @crutcher
- Allow linear to use quantized matmul (#3913) @wingertge
Datasets & Training
- Pre-Shuffle Multithread DataLoaders on Shuffle (#3390) @crutcher
- PixelDepth + Copy (#3419) @crutcher
- Add Dice-Sorenson Coefficient Metric (#3407) @MathijsdeBoer
- Add SelectionDataset, refactor ShuffledDataset, and add transform tests. (#3406) @crutcher
- Evenly distribute complete chunks/batches across partial dataset splits (#3476) @laggui
- Distributed Data Parallel (#3456) @Cielbird
- Use tensor ops for clip_by_norm (#3485) @laggui
SamplerDatasetdistribution fix; constructors and builder. (#3490) @crutcher- Unify transform usage of RngOptions. (#3577) @crutcher
- Fix bugs with ddp learning (#3581) @Cielbird
- Add support for CIFAR-10 and CIFAR-100 datasets (#3579) @buttfa
- Add with_interrupter for LearnerBuilder (#3611) @amfaber
- Improved Burn Train (#3614 #3935) @nathanielsimard @laggui
- Add 'TextFolderDataset' struct and
AgNewsDataset(#3698) @buttfa - Add PerplexityMetric for language model evaluation (#3707) @TheDarkchip
- Adding CER/WER metrics (#3418) @yazanmashal03
- Fix/autodiff/multi threads (#3793) @nathanielsimard
- Add
cautious_weight_decayto AdamW optimizer. (#3869) @crutcher - Fix evaluator dataloader device (#3893) @laggui
Backends
- Migrate to new cubecl multi tensor handle changes (#3136) @wingertge
- More memory control with scoped static memory management (#3410) @nathanielsimard
- Feat/fusion quant (#3454) @nathanielsimard
- Expose client utilities (#3559) @allenqm
- New CPU backend based on MLIR (#3411) @marcantoinem
- feat: ndarray dynamic tensor types and int tensor cast (#3647) @wingertge
- Implement optimized bool_select for primary backends (#3710) @TheDarkchip
- Add backend level is_nan / is_inf implementations (#3809) @laggui
- Feat/persistent memory (#3842) @nathanielsimard
- feat: add backend implementations for
Truncop (#3860) @mooori
Bug Fixes
- Fix ndarray interpolate coord precision at boundaries (#3481) @laggui
- Fix ndarray conv2d groups channels (#3415) @laggui
- Fix candle mask broadcasting (#3489) @laggui
- Update cubecl: fix wgpu vec to scalar cast (#3496) @Cielbird
- Fix/conv2d groups backward (#3521) @laggui
- Fix/conv3d backward groups (#3533) @laggui
- [Fix] Add some missing handling for flex32 (#3551) @wingertge
- Fix backward scatter dim (#3555) @laggui
- fix: Use correct datatype when filling boolean tensors (#3593) @wingertge
- fix: Ensure output layout is the same for non-inplace SIMD ops in ndarray (#3604) @wingertge
- Fix scalar binop not contiguous (#3636) @laggui
- Fix dtype dispatch in cubecl module ops (#3658) @laggui
- Fix wgpu bool and/or (#3664) @laggui
- Fix tch bool ones and rand int (#3684) @laggui
- fix: Select assign + bool cast (#3730) @wingertge
- Fix register_float_tensor to use the correct dtype (#3774) @A2va
- Fix: autotune errors with fu...
v0.18.0
Summary
This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.
Matrix Multiplication Improvements
Optimized matrix multiplication kernels with specialized implementations for:
- Matrix-vector (mat@vec)
- Vector-matrix (vec@mat)
- Inner product
- Outer product
And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.
For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.
Fusion Enhancements
- Improved reliability and performance of Burn Fusion through advanced optimizations.
- Added support for basic dead code elimination.
- Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.
Multi-Threading and Memory Management
- Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
- Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
- Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
- Fixed bugs related to premature memory deallocation, enhancing memory management stability.
CubeCL Config
By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.
A typical cubecl.toml file might look like this:
[profiling]
logger = { level = "basic", stdout = true }
[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }
[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }Each section configures a different aspect of CubeCL:
- profiling: Controls performance profiling and logging.
- autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
- compilation: Manages kernel compilation logging and cache.
For more info, check out the CubeCL book.
As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.
Changelog
Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.
MaxPool2dConfig
let pool = MaxPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
MaxPool1dConfig
let pool = MaxPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();
AvgPool2dConfig
let pool = AvgPool2dConfig::new(kernel_size)
+ .with_strides([1, 1])
.with_padding(PaddingConfig2d::Same)
.init();
AvgPool1dConfig
let pool = AvgPool1dConfig::new(kernel_size)
+ .with_stride(1)
.with_padding(PaddingConfig1d::Same)
.init();Module & Tensor
- Add tensor
grid::meshgrid(#3107 #3191) @crutcher - Add scalar tensor operations (#3127) @ArthurBrussee
- Orthogonal initialization (#3109) @dymat
- Support importing safetensors format (#2721) @wandbrandon @antimora
- Add
burn::linalgnorms (#3131) @crutcher - Extract Linear.forward to nn::functional::linear (#3147) @crutcher
- Base impl of matmul for Int tensor (#3201) @crutcher
- (perf) generate_mask functions optimizations (#3203) @tafia
- Add CosineEmbeddingLoss module and cosine_similarity function (#3207) @antimora
- Tensor::slice_fill() (#3221 #3223) @crutcher
- Base impl of
tensor.slice_dim(dim, range)(#3235) @crutcher - Support shifting pre-computed RoPE values (#3275) @laggui
- Improve RoPE partial shift case (#3290) @laggui
- Add
tensor.roll()and improveAsIndex(renamedIndexConversion) (#3281) @crutcher - [Breaking] Update pooling default strides to match kernel size (#3338) @lucianyao
- Add
is_finitetensor element wise op and fixis_close/all_closeinf (#3341) @jonboh
Backends
- [Perf] Interpolate optimizations (#3077) @wingertge
- [Perf] Slice assign (#3069) @wingertge
- Add multi stage conv (#3105) @wingertge
- [Perf] Convolution migration to NHWC (#3090) @wingertge
- Merge different convolution dimensional kernels (#3115) @wingertge
- Support reduce mixed precision accumulation w/ fusion (#3132) @nathanielsimard
- Update remote backend (#3175) @Cielbird
- Feat/autotune optional (#3188) @nathanielsimard
- cubecl unit matmul (#3214) @louisfd
- Update CubeCL for client based profiling (#3222) @ArthurBrussee
- Update cubecl unit matmul double buffered (#3233) @louisfd
- Burn-remote to_device function (#3189) @Cielbird
- Add Drop operation for fusion (#3263) @nathanielsimard
- Lazy tensor downloading in burn-remote (#3276) @Cielbird
- Improve specialized matmul (#3304) @louisfd
- Add autotune priority (#3347 #3378) @nathanielsimard
- Fix local tuner deadlock (#3384) @nathanielsimard
- Fix fusion wasm unsafe input (#3385 #3386) @nathanielsimard
Bug Fixes
- Fix WASM deadlock by really properly not capturing locks (#3123) @ArthurBrussee
- Fix burn-cubecl with autotune disabled (#3141) @wingertge
- Fix fusion multiple reshapes (#3220) @nathanielsimard
- Fix/fusion multiple streams (#3297) @nathanielsimard
- Fix gather broadcasted indices in kernel impl and fusion (#3337) @laggui
- Fix rand interval (#3321) @laggui
- Restrict binary op lhs/rhs alias (#3349) @laggui
- Fix sum fallback when atomic add is not supported (#3369) @laggui
Documentation & Examples
- Update pytorch-model.md with a new troubleshooting help (#3081) @antimora
- Contributor example instructions (#3153) @AshAnand34
- Update README.md with DeepWiki badge (#3192) @antimora
- Add recursion_limit macro to getting started exemples code (#3238) @Marc-AnthonyG
- KaTeX for Mathematical expressions in docstrings (#3278) @BhavyeMathur
- Add Metal backend support to custom-image-dataset (#3335 #3354) @TsaoLun
- Add link to license in README badge (#3356) @Olexandr88
Fixes
- Fix typo in Burn Book (#3113) @danny-burrows
- fix typos (#3186) @omahs
- Fix Typos in Documentation Comments (#3280) @leopardracer
- Fix typo in code documentation for BurnGraph codegen (#3286) @kilavvy
- Fix error messages from tensor checks for flatten (#3319) @NoVegetable
- Fix broken link to burn-tch (#3365) @dbdr
- Update documentation description for nonzero and nonzero_async (#3368) @catch-twenty-two
ONNX Support
- ONNX Import: switch to rank inferencing, rename shape to static_shape, decouple tensor shape info (#3037) @antimora
- Restrict ONNX opset to 16 and up (#3051) @antimora
- Allow Shape input type for Slice operation (#3092) @antimora
- Support onnx and, or & xor nodes (#3173) @tye-singwa
- Add support ONNX instance norm (#3177) @tye-singwa
- Onnx ceil & round (#3225) @tye-singwa
- Add support onnx group norm (#3245) @tye-singwa
- Add onnx SpaceToDepth / DepthToSpace (#3277) @tye-singwa
- Fix onnx topological sort check (#3284) @tye-singwa
- Add onnx ArgMin node (#3285) @tye-singwa
- Add support onnx size (#3301) @tye-singwa
- Support flexible backend selection for import tests (#3372 #3380) @lucianyao
- Fix ONNX node name sanitization and allow ai.onnx.ml domain (#3371) @antimora
Enhancements
- Replace some powf->powi (#3152) @ArthurBrussee
- Improve fusion compilation speed (#3155) @nathanielsimard
- Perf/remove repeat dim (#3183) @nathanielsimard
- Perf: Fusion search for composed optimization (#3258) @nathanielsimard
- Improve matmul selector (#3307 #3343 #3350 #3376) @nathanielsimard
Refactoring
- Refactor CubeCL slices (#3104) @nathanielsimard
- CubeCL init refactor (#3128) @nathanielsimard
- Refactor narrow, chunk and split (#3137) @laggui
- Refactor quantization scheme (#3042) @maxtremblay
- Migrated prng (random) to CubeCL (#3165 #3170) @Cielbird
- Break down
test_onnx.rsinto test subdirectories (#3144) @antimora - Refactor: Move op_configuration.rs from burn-import to onnx-ir (#3126) @antimora
- Fix relative cmp + debug tools (#3197) @nathanielsimard
- Refactor cubecl line size matmul (#3219) @louisfd
- Absolute tolerance is too tight for strict/balanced/permissive (#3242) @laggui
- Fix clippy rust 1.88 and cargo run checks usage (#3325 #3320) @laggui
- Remove hip os cfg flags (#3336) @laggui
- Update cubecl matmul refactor / docs (#3366) @louisfd
Miscellaneous
- Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
- Replace run-checks scripts with command alias (#3118) @laggui
- Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
- Add cubecl.toml config (#3150) @nathanielsimard
- Use
CUBECL_DEBUG_OPTION=profilemacos ci (#...
v0.17.1
Bug Fixes & Improvements
- Downgrade to zip 2.4.2 (fixes #3224) @laggui
- Fix non contiguous bug with comparison op (#3241) @nathanielsimard
- Fix/reduce fusion (#3172) @nathanielsimard
- Fix: fusion multi-block scalar index sharing (#3167) @nathanielsimard
- Fix to NdArray int_max_dim bug (#3140) @crutcher
- Make is_contiguous check common (#3083) @laggui
- Fix clamp min/max line size > 1 (#3078) @laggui
- Fix vectorization problem with fusion on reshaped not contiguous tensors (#3075) @nathanielsimard