v0.9.0

Latest

Latest

nathanielsimard released this 15 Jan 19:56

· 64 commits to main since this release

2679028

What's Changed

remove is_padded check by @louisfd in #988
Fix plane matmul selection & reduce workgroup invocations by @laggui in #989
bump 0.7.1 by @louisfd in #992
Fix/misc/release 07 by @louisfd in #994
Bump cubecl to version 0.9.0 by @laggui in #996
perf: Separate batch layout from main global layout to allow prefetching batch offset by @wingertge in #991
opt: Add automatic unrolling of unit loops by @wingertge in #986
opt: Make GVN side-effect free and assume loops are executed at least once by @wingertge in #985
Attention: some test refactoring by @louisfd in #999
fix: Fix ConcreteOutputFactory implementation in convolution by @wingertge in #998
Flash Attention: Unit Attention by @louisfd in #1002
ci: check version and use tracel action and xtask to publish by @syl20bnr in #995
Fix/memory usage by @nathanielsimard in #1001
feat: Allow mma matmul to be selected by @wingertge in #1003
refactor: TMA checks by @wingertge in #1006
feat: Update tune key to enable safely tuning TMA algorithms by @wingertge in #1007
feat: Auto-detect CUDA version and fix some 12.8 features by @wingertge in #1008
feat: Granular math mode by @wingertge in #1000
Define numeric types by @nathanielsimard in #1009
refactor: Read Strategy by @wingertge in #1010
Fix cudarc feature flags for no default-features by @laggui in #1011
Flash attention: unit & accelerated working attentions + fix partitions by @louisfd in #1012
Flash attention: batch and num heads by @louisfd in #1014
Flash Attention: bench by @louisfd in #1016
feat: Specialized matmul using barriers by @wingertge in #1015
Refactor/dtype by @nathanielsimard in #1017
Disable/mma/amd by @nathanielsimard in #1020
Add TMA checks before launch by @laggui in #1021
Set pre-release version by @nathanielsimard in #1023
Ci/disable version check by @nathanielsimard in #1024
Flash Attention: Transpose key later by @louisfd in #1022
Define many by @nathanielsimard in #1025
Flash Attention: refactor dtypes by @louisfd in #1026
Flash Attention: fix logical mask bug when kv partition > 1 by @louisfd in #1027
Matmul: readers & jobs generic only on global and stage types by @louisfd in #1028
Fix conv tests by @louisfd in #1029
Fix remainder int by @laggui in #1033
feat: Implement ldmatrix and refactor manual mma args by @wingertge in #1018
Feat/pinned mem by @nathanielsimard in #1030
Fix no std file by @nathanielsimard in #1037
fix: Fix issue with ldmatrix address conversion by @wingertge in #1038
feat: Swizzled shared memory by @wingertge in #1035
feat: add trigonometric functions by @relativityhd in #861
Fix SPIR-V signed int remainder semantics by @laggui in #1036
fix: Fix MMA on HIP by @wingertge in #1039
Fix deadlock when copy by @nathanielsimard in #1041
Fix: invalid tile size by @nathanielsimard in #1043
fix: Feature gate fast tanh by @wingertge in #1045
Bump version by @nathanielsimard in #1054
perf: MMA line size by @wingertge in #1044
Fix fma so its callable from #cube functions by @amfaber in #1049
Treat Operation::Copy as an implicit cast as well on the CPU backend by @amfaber in #1050
Ensure we propagate the return index when a return block is folded into an existing block by @amfaber in #1051
Improve compilation time for Burn by @nathanielsimard in #1055
Matmul: Major config refactor by @louisfd in #1042
Kernels: some cleanup by @louisfd in #1058
Fix: assertion for line_size in naive.rs (#1046) by @PulsarUnderscore in #1047
fix: Fix writer stage size that was broken during the config migration by @wingertge in #1061
Fix/autotuner by @nathanielsimard in #1062
Chore: Prepare pre-version 0.9.0-pre.3 by @nathanielsimard in #1066
Fix workgroup_id typo in book by @BenFradet in #1060
fix: Fix composite merge pass with mutable values by @wingertge in #1063
feat: Implement stmatrix and stage casting to support it by @wingertge in #1056
refactor: Scalars by @wingertge in #1064
Feat/event bus by @nathanielsimard in #950
Flash Attention: use loader from matmul + fix sync bug by @louisfd in #1067
refactor: Move Runtime to cubecl-runtime by @wingertge in #1068
Flash Attention: vectorized query + fix metal wmma load from global memory + fix main compilation by @louisfd in #1069
Flash Attention: fix all-masked rows by @louisfd in #1070
Enable tuner name by @nathanielsimard in #1071
Flash attention: lines for mask and value by @louisfd in #1072
Flash Attention: all lines by @louisfd in #1073
Flash Attention: test and fix f16 by @louisfd in #1074
Feat/execution error by @nathanielsimard in #1075
Flash Attention: strengthen test suite by @louisfd in #1077
Feat/runtime error by @nathanielsimard in #1078
Flash Attention: a bit of selector and enable unit attention for burn by @louisfd in #1079
Fix missing parenthesis on .into call by @BjornTheProgrammer in #1080
feat: Rewrite async loaders to make them actually useful by @wingertge in #1076
Set pre-release by @nathanielsimard in #1082
Feat/improve errors by @nathanielsimard in #1084
refactor: Convolution by @wingertge in #1083
Fix/stuff by @nathanielsimard in #1085
fix: Fix line size selection for convolution by @wingertge in #1086
Feat/validation error by @nathanielsimard in #1088
.gitignore editors/ides (cloned from burn) by @crutcher in #1089
feat: Shared values by @wingertge in #1090
Flash Attention: Blueprint by @louisfd in #1087
Migrate kernels part 1 by @nathanielsimard in #1095
Bump pre-release by @nathanielsimard in #1096
Remove code by @nathanielsimard in #1097
Feat/cpu scheduler by @nathanielsimard in #1098
Fix/no std runtime by @nathanielsimard in #1100
Fix unused pattern by @crutcher in #1091
refactor: Tensor map by @wingertge in #1099
feat: Add zero-copy Bytes support via bytes::Bytes allocator by @antimora in #1093
add epsilon for type by @louisfd in #1101
Fix try_into_vec for SharedBytesAllocationController by @antimora in #1102
Various fix by @nathanielsimard in #1104
fix: Fix CUDA < 12.8 branch for im2colWide by @wingertge in #1103
Support const in match patterns for #[cube] macro by @sepcnt in #1105
Add unaligned_line_read and unaligned_line_write as cpu-only cubecl extensions by @amfaber in #1052
CPU Runtime: Fix wrong limits by @nathanielsimard in #1107
Add tracing instrumentation and dependencies across crates by @crutcher in #1106
Add support for hypot and reciprocal hypot. by @vaijira in #1048
Plane non uniform control flow feature by @nathanielsimard in #1108
Refactor/cube dim by @nathanielsimard in #1111
fix: Fixes various issues with --all-features builds by @wingertge in #1110
Chore/pre release 6 by @nathanielsimard in #1114
fix: Fix invalid phi nodes from partially destructured array values by @wingertge in #1118
Deep-plumb tracing feature. by @crutcher in #1116
2D into contiguous by @nathanielsimard in #1113
fix: Fix free handling for CPU backend by @wingertge in #1123
Update readme link to matmul crate by @milesfrain in #1122
Add inverse trigonometric and hyperbolic trait impls for Line
by @ravituringworks in #1125
refactor: Constants by @wingertge in #1121
Gfx12 support by @marion-santiago in #1126
fix: async mem not available in vgpu on some cuda versions by @Na1w in #1128
Fix atan2 line by @laggui in #1130
Feat/comptime device props by @nathanielsimard in #1129
feat: usize indexing by @wingertge in #1127
Enable #[test_log::test] support; add PERFORMANCE.md doc. by @crutcher in #1132
refactor: Close gaps between CubeCL and standard Rust by @wingertge in #1131
Lift/Test valid_strides into a layout validation lib. by @crutcher in #1133
chore: update xtask to 4.9.0 by @syl20bnr in #1135
Docs/Debug-Checks/Cleaner flow for write_to_{gpu,cpu}. by @crutcher in #1137
fix: Store packed dimension on packed quant stores so it can be correctly handled for swapped tensors by @wingertge in #1134
fix: Fix broken PR #1133 by @wingertge in #1140
Fix plane count check with max_units_per_cube by @laggui in #1142
Pre version by @nathanielsimard in #1144

Contributors

antimora, wingertge, and 16 other contributors

Assets 2