What's Changed
- remove is_padded check by @louisfd in #988
- Fix plane matmul selection & reduce workgroup invocations by @laggui in #989
- bump 0.7.1 by @louisfd in #992
- Fix/misc/release 07 by @louisfd in #994
- Bump cubecl to version 0.9.0 by @laggui in #996
- perf: Separate batch layout from main global layout to allow prefetching batch offset by @wingertge in #991
- opt: Add automatic unrolling of unit loops by @wingertge in #986
- opt: Make GVN side-effect free and assume loops are executed at least once by @wingertge in #985
- Attention: some test refactoring by @louisfd in #999
- fix: Fix
ConcreteOutputFactoryimplementation in convolution by @wingertge in #998 - Flash Attention: Unit Attention by @louisfd in #1002
- ci: check version and use tracel action and xtask to publish by @syl20bnr in #995
- Fix/memory usage by @nathanielsimard in #1001
- feat: Allow
mmamatmul to be selected by @wingertge in #1003 - refactor: TMA checks by @wingertge in #1006
- feat: Update tune key to enable safely tuning TMA algorithms by @wingertge in #1007
- feat: Auto-detect CUDA version and fix some 12.8 features by @wingertge in #1008
- feat: Granular math mode by @wingertge in #1000
- Define numeric types by @nathanielsimard in #1009
- refactor: Read Strategy by @wingertge in #1010
- Fix cudarc feature flags for no default-features by @laggui in #1011
- Flash attention: unit & accelerated working attentions + fix partitions by @louisfd in #1012
- Flash attention: batch and num heads by @louisfd in #1014
- Flash Attention: bench by @louisfd in #1016
- feat: Specialized matmul using barriers by @wingertge in #1015
- Refactor/dtype by @nathanielsimard in #1017
- Disable/mma/amd by @nathanielsimard in #1020
- Add TMA checks before launch by @laggui in #1021
- Set pre-release version by @nathanielsimard in #1023
- Ci/disable version check by @nathanielsimard in #1024
- Flash Attention: Transpose key later by @louisfd in #1022
- Define many by @nathanielsimard in #1025
- Flash Attention: refactor dtypes by @louisfd in #1026
- Flash Attention: fix logical mask bug when kv partition > 1 by @louisfd in #1027
- Matmul: readers & jobs generic only on global and stage types by @louisfd in #1028
- Fix conv tests by @louisfd in #1029
- Fix remainder int by @laggui in #1033
- feat: Implement
ldmatrixand refactor manual mma args by @wingertge in #1018 - Feat/pinned mem by @nathanielsimard in #1030
- Fix no std file by @nathanielsimard in #1037
- fix: Fix issue with ldmatrix address conversion by @wingertge in #1038
- feat: Swizzled shared memory by @wingertge in #1035
- feat: add trigonometric functions by @relativityhd in #861
- Fix SPIR-V signed int remainder semantics by @laggui in #1036
- fix: Fix MMA on HIP by @wingertge in #1039
- Fix deadlock when copy by @nathanielsimard in #1041
- Fix: invalid tile size by @nathanielsimard in #1043
- fix: Feature gate fast tanh by @wingertge in #1045
- Bump version by @nathanielsimard in #1054
- perf: MMA line size by @wingertge in #1044
- Fix fma so its callable from #cube functions by @amfaber in #1049
- Treat Operation::Copy as an implicit cast as well on the CPU backend by @amfaber in #1050
- Ensure we propagate the return index when a return block is folded into an existing block by @amfaber in #1051
- Improve compilation time for Burn by @nathanielsimard in #1055
- Matmul: Major config refactor by @louisfd in #1042
- Kernels: some cleanup by @louisfd in #1058
- Fix: assertion for line_size in naive.rs (#1046) by @PulsarUnderscore in #1047
- fix: Fix writer stage size that was broken during the config migration by @wingertge in #1061
- Fix/autotuner by @nathanielsimard in #1062
- Chore: Prepare pre-version 0.9.0-pre.3 by @nathanielsimard in #1066
- Fix workgroup_id typo in book by @BenFradet in #1060
- fix: Fix composite merge pass with mutable values by @wingertge in #1063
- feat: Implement stmatrix and stage casting to support it by @wingertge in #1056
- refactor: Scalars by @wingertge in #1064
- Feat/event bus by @nathanielsimard in #950
- Flash Attention: use loader from matmul + fix sync bug by @louisfd in #1067
- refactor: Move
Runtimetocubecl-runtimeby @wingertge in #1068 - Flash Attention: vectorized query + fix metal wmma load from global memory + fix main compilation by @louisfd in #1069
- Flash Attention: fix all-masked rows by @louisfd in #1070
- Enable tuner name by @nathanielsimard in #1071
- Flash attention: lines for mask and value by @louisfd in #1072
- Flash Attention: all lines by @louisfd in #1073
- Flash Attention: test and fix f16 by @louisfd in #1074
- Feat/execution error by @nathanielsimard in #1075
- Flash Attention: strengthen test suite by @louisfd in #1077
- Feat/runtime error by @nathanielsimard in #1078
- Flash Attention: a bit of selector and enable unit attention for burn by @louisfd in #1079
- Fix missing parenthesis on .into call by @BjornTheProgrammer in #1080
- feat: Rewrite async loaders to make them actually useful by @wingertge in #1076
- Set pre-release by @nathanielsimard in #1082
- Feat/improve errors by @nathanielsimard in #1084
- refactor: Convolution by @wingertge in #1083
- Fix/stuff by @nathanielsimard in #1085
- fix: Fix line size selection for convolution by @wingertge in #1086
- Feat/validation error by @nathanielsimard in #1088
- .gitignore editors/ides (cloned from burn) by @crutcher in #1089
- feat: Shared values by @wingertge in #1090
- Flash Attention: Blueprint by @louisfd in #1087
- Migrate kernels part 1 by @nathanielsimard in #1095
- Bump pre-release by @nathanielsimard in #1096
- Remove code by @nathanielsimard in #1097
- Feat/cpu scheduler by @nathanielsimard in #1098
- Fix/no std runtime by @nathanielsimard in #1100
- Fix unused pattern by @crutcher in #1091
- refactor: Tensor map by @wingertge in #1099
- feat: Add zero-copy Bytes support via bytes::Bytes allocator by @antimora in #1093
- add epsilon for type by @louisfd in #1101
- Fix try_into_vec for SharedBytesAllocationController by @antimora in #1102
- Various fix by @nathanielsimard in #1104
- fix: Fix CUDA < 12.8 branch for im2colWide by @wingertge in #1103
- Support const in match patterns for #[cube] macro by @sepcnt in #1105
- Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions by @amfaber in #1052
- CPU Runtime: Fix wrong limits by @nathanielsimard in #1107
- Add
tracinginstrumentation and dependencies across crates by @crutcher in #1106 - Add support for hypot and reciprocal hypot. by @vaijira in #1048
- Plane non uniform control flow feature by @nathanielsimard in #1108
- Refactor/cube dim by @nathanielsimard in #1111
- fix: Fixes various issues with
--all-featuresbuilds by @wingertge in #1110 - Chore/pre release 6 by @nathanielsimard in #1114
- fix: Fix invalid phi nodes from partially destructured array values by @wingertge in #1118
- Deep-plumb
tracingfeature. by @crutcher in #1116 - 2D into contiguous by @nathanielsimard in #1113
- fix: Fix free handling for CPU backend by @wingertge in #1123
- Update readme link to matmul crate by @milesfrain in #1122
- Add inverse trigonometric and hyperbolic trait impls for Line
by @ravituringworks in #1125
- refactor: Constants by @wingertge in #1121
- Gfx12 support by @marion-santiago in #1126
- fix: async mem not available in vgpu on some cuda versions by @Na1w in #1128
- Fix atan2 line by @laggui in #1130
- Feat/comptime device props by @nathanielsimard in #1129
- feat:
usizeindexing by @wingertge in #1127 - Enable
#[test_log::test]support; add PERFORMANCE.md doc. by @crutcher in #1132 - refactor: Close gaps between CubeCL and standard Rust by @wingertge in #1131
- Lift/Test
valid_stridesinto a layout validation lib. by @crutcher in #1133 - chore: update xtask to 4.9.0 by @syl20bnr in #1135
- Docs/Debug-Checks/Cleaner flow for write_to_{gpu,cpu}. by @crutcher in #1137
- fix: Store packed dimension on packed quant stores so it can be correctly handled for swapped tensors by @wingertge in #1134
- fix: Fix broken PR #1133 by @wingertge in #1140
- Fix plane count check with
max_units_per_cubeby @laggui in #1142 - Pre version by @nathanielsimard in #1144