Skip to content

Releases: tracel-ai/burn

v0.20.0-pre.6

18 Dec 21:27
91dd62c

Choose a tag to compare

v0.20.0-pre.6 Pre-release
Pre-release

What's Changed

v0.20.0-pre.5

08 Dec 14:53
42edc63

Choose a tag to compare

v0.20.0-pre.5 Pre-release
Pre-release

What's Changed

v0.20.0-pre.4

01 Dec 19:15

Choose a tag to compare

v0.20.0-pre.4 Pre-release
Pre-release

What's Changed

v0.20.0-pre.3

24 Nov 17:37
88d662d

Choose a tag to compare

v0.20.0-pre.3 Pre-release
Pre-release

What's Changed

v0.20.0-pre.2

17 Nov 15:26
cc0f22a

Choose a tag to compare

v0.20.0-pre.2 Pre-release
Pre-release

What's Changed

v0.20.0-pre.1

11 Nov 15:35

Choose a tag to compare

v0.20.0-pre.1 Pre-release
Pre-release

Summary

This release includes significant performance improvements, bug fixes, and architectural refactoring.
Key Improvements:

  • TMA autotuning and MMA matmul tuning enabled for better performance
  • ONNX-IR refactored to an op/node-centric architecture IR refactored to define outputs as a function of the operation

Bug Fixes:

  • Fixed autodiff graph cleanup issues (multiple fixes for deferred/consumed nodes)
  • Fixed Linear layer panic when output size is one
  • Fixed PyTorch pickle reader regression with integer dict keys
  • Fixed RoPE sum_dim calculation
  • Fixed tensor *_like dtype preservation
  • Fixed squeeze check for D2 > 0
  • Fixed QLinear implementation
  • Fixed async barrier & TMA checks

New Features:

  • Added matvec operation
  • Added support for custom learning strategies
  • Added Candle device seeding
  • Added Shape::ravel_index for row-major raveling
  • Generalized linalg::outer semantics with new linalg::outer_dim
  • Implemented error handling for DataError
  • Added square() optimization where appropriate

v0.19.1

06 Nov 16:18

Choose a tag to compare

Bug Fixes & Improvements

v0.19.0

28 Oct 17:00

Choose a tag to compare

Summary

This release brings major improvements to enable efficient distributed training, quantization, and CPU support in Burn.

To achieve true multi-GPU parallelism, we had to rethink several core systems: we implemented multi-stream execution to keep all GPUs busy, optimized device transfers to avoid unnecessary synchronization, and redesigned our locking strategies to eliminate bottlenecks in autotuning, fusion, and autodiff. We also introduced burn-collective for gradient synchronization and refactored our training loop to support different distributed training strategies.

Additionally, we added comprehensive quantization support, allowing models to use significantly less memory while maintaining performance through fused dequantization and optimized quantized operations.

Finally, we introduced a new CPU backend powered by MLIR and LLVM, bringing the same JIT compilation, autotuning, and fusion capabilities from our GPU backends to CPU execution.

As with previous releases, this version includes various bug fixes, further optimizations and enhanced documentation. Support for ONNX models has also been expanded, with additional operators and bug fixes for better operator coverage.

For more details, check out the release post on our website.

Changelog

Breaking

We've introduced a couple of breaking API changes with this release. The affected interfaces are detailed in the sections below.

Learning Strategy

We refactored the Learner to support better distributed training strategies. Instead of registering a list of device(s), you now specify a training strategy.

  let learner = LearnerBuilder::new(artifact_dir)
      .metric_train_numeric(AccuracyMetric::new())
      .metric_valid_numeric(AccuracyMetric::new())
      .metric_train_numeric(LossMetric::new())
      .metric_valid_numeric(LossMetric::new())
      .with_file_checkpointer(CompactRecorder::new())
-     .devices(vec![device.clone()])
+     .learning_strategy(LearningStrategy::SingleDevice(device.clone()))
      .num_epochs(config.num_epochs)
      .summary()
      .build(
          config.model.init::<B>(&device),
          config.optimizer.init(),
          config.learning_rate,
      );

Learner Training Result

The Learner previously lacked an evaluation loop. We extended its return type to include all training states in a TrainingResult, which includes the trained model and a metrics renderer.

- let model_trained = learner.fit(dataloader_train, dataloader_valid);
+ let result = learner.fit(dataloader_train, dataloader_valid);

- model_trained
+ result
+    .model
     .save_file(format!("{artifact_dir}/model"), &CompactRecorder::new())
     .expect("Trained model should be saved successfully");

This enables the renderer to be reused by the new evaluator so that training and evaluation metrics appear together in the TUI dashboard:

let mut renderer = result.renderer;
let evaluator = EvaluatorBuilder::new(artifact_dir)
    .renderer(renderer)
    .metrics((AccuracyMetric::new(), LossMetric::new()))
    .build(result.model.clone());

evaluator.eval(name, dataloader_test);

Interface Changes

Config

The Config trait now requires Debug:

- #[derive(Config)]
+ #[derive(Config, Debug)]
  pub struct TrainingConfig {
      // ...
  }

BatchNorm

BatchNorm no longer requires the spatial dimension generic:

  #[derive(Module, Debug)]
  pub struct ConvBlock<B: Backend> {
      conv: nn::conv::Conv2d<B>,
-     norm: BatchNorm<B, 2>,
+     norm: BatchNorm<B>,
      pool: Option<MaxPool2d>,
      activation: nn::Relu,
  }

Backend::seed

Seeding is now device-specific:

- B::seed(seed);
+ B::seed(&device, seed);

Tensor

For consistency with other methods like unsqueeze() / unsqueeze_dim(dim), squeeze(dim) was renamed:

- tensor.squeeze(dim)
+ tensor.squeeze_dim(dim)

We've also added a tensor.squeeze() method which squeezes all singleton dimensions.

Finally, we removed tensor ^ T syntax, which was clunky.

- use burn::tensor::T;
- tensor ^ T
+ tensor.t()

tensor.t() is also a simple alias for tensor.transpose().

Module & Tensor

Datasets & Training

Backends

Bug Fixes

Read more

v0.18.0

18 Jul 16:27
f5d889d

Choose a tag to compare

Summary

This release marks a significant step forward in performance, reliability, and optimization, ensuring a more robust and efficient system for our users. We've expanded our CI testing suite to address multi-threading, lazy evaluation, and async execution issues, ensuring robust performance across an increasing number of supported platforms.

Matrix Multiplication Improvements

Optimized matrix multiplication kernels with specialized implementations for:

  • Matrix-vector (mat@vec)
  • Vector-matrix (vec@mat)
  • Inner product
  • Outer product

And enhanced flexibility in the matrix multiplication kernel generation engine, surpassing traditional GEMM (General Matrix Multiply) approaches.

For more details, including performance benchmarks, check out our state-of-the-art multiplatform matrix multiplication post.

Fusion Enhancements

  • Improved reliability and performance of Burn Fusion through advanced optimizations.
  • Added support for basic dead code elimination.
  • Introduced a new search engine that optimally reorders operations to maximize optimization opportunities, improving resilience to tensor operation ordering.

Multi-Threading and Memory Management

  • Resolved critical multi-threading issues by adopting a new approach to support multiple concurrent streams.
  • Burn Fusion's lazy evaluation of registered operations across concurrent streams now places greater demands on memory management. To address this:
    • Implemented a robust memory leak test in our CI pipeline to verify the runtime's internal state, ensuring all handles and concurrent streams are properly cleaned up in all test cases.
    • Fixed bugs related to premature memory deallocation, enhancing memory management stability.

CubeCL Config

By default, CubeCL loads its configuration from a TOML file (cubecl.toml or CubeCL.toml) located in your current directory or any parent directory. If no configuration file is found, CubeCL falls back to sensible defaults.

A typical cubecl.toml file might look like this:

[profiling]
logger = { level = "basic", stdout = true }

[autotune]
level = "balanced"
logger = { level = "minimal", stdout = true }

[compilation]
logger = { level = "basic", file = "cubecl.log", append = true }

Each section configures a different aspect of CubeCL:

  • profiling: Controls performance profiling and logging.
  • autotune: Configures the autotuning system, which benchmarks and selects optimal kernel parameters.
  • compilation: Manages kernel compilation logging and cache.

For more info, check out the CubeCL book.

As with previous releases, this version includes various bug fixes, many internal optimizations, and backend upgrades that reinforce the framework's performance and flexibility across platforms.

Changelog

Breaking: the default stride(s) for pooling modules now match the kernel size instead of defaulting to strides of 1. This will affect output shapes if strides were not explicitly set.

MaxPool2dConfig
let pool = MaxPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();
MaxPool1dConfig
let pool = MaxPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();
AvgPool2dConfig
let pool = AvgPool2dConfig::new(kernel_size)
+   .with_strides([1, 1])
    .with_padding(PaddingConfig2d::Same)
    .init();
AvgPool1dConfig
let pool = AvgPool1dConfig::new(kernel_size)
+   .with_stride(1)
    .with_padding(PaddingConfig1d::Same)
    .init();

Module & Tensor

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Enhancements

Refactoring

Miscellaneous

  • Fix conv2d test tolerance & disable crates cache on stable linux-std runner (#3114) @laggui
  • Replace run-checks scripts with command alias (#3118) @laggui
  • Relax tolerance transformer autoregressive test (ndarray failure) (#3143) @crutcher
  • Add cubecl.toml config (#3150) @nathanielsimard
  • Use CUBECL_DEBUG_OPTION=profile macos ci (#...
Read more

v0.17.1

03 Jun 13:07

Choose a tag to compare

Bug Fixes & Improvements