Release v0.2.0 · pytorch/helion

What's Changed

Verify compiled kernels in subprocess by @jansel in #914
Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
Support warp specialization on B200 by @oulgen in #935
Update README.md by @oulgen in #943
Register tile symbol origin, to support tile + offset use case in blackwell attention by @yf225 in #939
[CI] Print failed tests by @oulgen in #942
Update examples to use run_example by @jansel in #941
blackwell attn with triton attr set by @v0i0 in #918
Set static_shapes=True by @oulgen in #937
run.py env var to skip exception logging by @v0i0 in #946
Fix bug with unit sized dims and block_sizes by @jansel in #932
Update static_shapes docs by @jansel in #951
Add tile.count by @oulgen in #955
Auto detect low vram by @oulgen in #956
[CI] Use official PyTorch 2.9 by @oulgen in #962
Use interleaved_bench for run_example by @jansel in #945
Generalize tile_with_offset pass by @jansel in #949
Docstring updates by @jansel in #952
Import updates by @jansel in #953
Add missing environment variables to docs by @jansel in #957
Print out errors vs timeouts in autotuning status by @jansel in #960
Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
Exit autotuning faster on KeyboardInterrupt by @jansel in #963
Remove default settings by @jansel in #964
Add missing settings environment variables by @jansel in #965
Skip test_differential_evolution_search due to slowness by @jansel in #968
[Benchmark CI] Give nightly job permissions by @oulgen in #970
[Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
[Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
[blackwell attn example] qk scale as param by @v0i0 in #969
[Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
Add initial backwards compatibility tests by @oulgen in #958
Remove unrolling + warp spec by @PaulZhang12 in #967
[Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
[Benchmark] Fix tritonbench auto-installation by @yf225 in #980
[Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
Make fork default precompilation strategy by @oulgen in #979
[benchmarks] change tritonbench path by @xuzhao9 in #966
Add skipIfA10G decorator by @yf225 in #982
Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
Fix timeouts in autotuning by @jansel in #985
Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
Remove extra debug log for timeouts by @jansel in #987
Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
Generalize test cases to support XPU by @EikanWang in #983
Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
Update README.md by @oulgen in #992
Update README.md by @oulgen in #993
Mamba2 Chunk Scan & State by @v0i0 in #950
Remove unrolling with tma + pipelining by @PaulZhang12 in #994
Add provenance annotations to output code by @jansel in #988

Full Changelog: v0.1.8...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!