v0.2.0
What's Changed
- Verify compiled kernels in subprocess by @jansel in #914
- Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
- Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
- Support warp specialization on B200 by @oulgen in #935
- Update README.md by @oulgen in #943
- Register tile symbol origin, to support
tile + offsetuse case in blackwell attention by @yf225 in #939 - [CI] Print failed tests by @oulgen in #942
- Update examples to use run_example by @jansel in #941
- blackwell attn with triton attr set by @v0i0 in #918
- Set static_shapes=True by @oulgen in #937
- run.py env var to skip exception logging by @v0i0 in #946
- Fix bug with unit sized dims and block_sizes by @jansel in #932
- Update static_shapes docs by @jansel in #951
- Add tile.count by @oulgen in #955
- Auto detect low vram by @oulgen in #956
- [CI] Use official PyTorch 2.9 by @oulgen in #962
- Use interleaved_bench for run_example by @jansel in #945
- Generalize tile_with_offset pass by @jansel in #949
- Docstring updates by @jansel in #952
- Import updates by @jansel in #953
- Add missing environment variables to docs by @jansel in #957
- Print out errors vs timeouts in autotuning status by @jansel in #960
- Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
- Exit autotuning faster on KeyboardInterrupt by @jansel in #963
- Remove default settings by @jansel in #964
- Add missing settings environment variables by @jansel in #965
- Skip test_differential_evolution_search due to slowness by @jansel in #968
- [Benchmark CI] Give nightly job permissions by @oulgen in #970
- [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
- [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
- [blackwell attn example] qk scale as param by @v0i0 in #969
- [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
- Add initial backwards compatibility tests by @oulgen in #958
- Remove unrolling + warp spec by @PaulZhang12 in #967
- [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
- [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
- [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
- Make fork default precompilation strategy by @oulgen in #979
- [benchmarks] change tritonbench path by @xuzhao9 in #966
- Add skipIfA10G decorator by @yf225 in #982
- Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
- Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
- Fix timeouts in autotuning by @jansel in #985
- Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
- Remove extra debug log for timeouts by @jansel in #987
- Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
- Generalize test cases to support XPU by @EikanWang in #983
- Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
- Update README.md by @oulgen in #992
- Update README.md by @oulgen in #993
- Mamba2 Chunk Scan & State by @v0i0 in #950
- Remove unrolling with tma + pipelining by @PaulZhang12 in #994
- Add provenance annotations to output code by @jansel in #988
Full Changelog: v0.1.8...v0.2.0