Releases: pytorch/helion
Releases · pytorch/helion
v0.2.1
What's Changed
- No autotuning on block_ptr if tma is available by @PaulZhang12 in #997
- Add reps for benchmarking stability by @PaulZhang12 in #999
- Prioritize outermost loop for warp spec by @PaulZhang12 in #1000
- Add backward pass for softmax kernel by @karthickai in #744
- Fix linter in softmax by @oulgen in #1003
- Fix test_examples.expected by @oulgen in #1002
- Beef up caching tests by @oulgen in #1001
- Add HELION_ASSERT_CACHE_HIT to debug/explain cache miss by @oulgen in #1006
- Better error message for calling Helion kernel from another kernel by @yf225 in #1008
- Assert that we are cache hitting on the CI by @oulgen in #1007
- Always raise
FailedToUnpackTilewhenfor tile_m, tile_d in hl.tile(m, d)is used by @yf225 in #1009 - Adding demo for running softmax kernel on Google colab by @choijon5 in #944
- int4 gemm accurate baselines by @PaulZhang12 in #1010
- Add sitemap xml by @sekyondaMeta in #1013
- [helion] backward support for swiglu by @shunting314 in #756
- Raise informative error when
hl.dotwith 3D inputs have batch dim mismatch by @yf225 in #1012 - [CI] Fix AMD journal check errors by @yf225 in #1016
- Support
breakpoint()in device code when interpret mode is on by @yf225 in #1020 - Sort requirements file by @oulgen in #1021
- Better type checking for eviction policies by @oulgen in #1024
- Bump linter versions by @jansel in #1018
- Garbage collect expected results by @jansel in #1017
- Make indexing choice a list by @oulgen in #1025
- [Docs] Add list of indexing autotuning docs by @oulgen in #1027
- Make store indexing also individually tunable by @oulgen in #1028
New Contributors
- @shunting314 made their first contribution in #756
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- Verify compiled kernels in subprocess by @jansel in #914
- Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
- Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
- Support warp specialization on B200 by @oulgen in #935
- Update README.md by @oulgen in #943
- Register tile symbol origin, to support
tile + offsetuse case in blackwell attention by @yf225 in #939 - [CI] Print failed tests by @oulgen in #942
- Update examples to use run_example by @jansel in #941
- blackwell attn with triton attr set by @v0i0 in #918
- Set static_shapes=True by @oulgen in #937
- run.py env var to skip exception logging by @v0i0 in #946
- Fix bug with unit sized dims and block_sizes by @jansel in #932
- Update static_shapes docs by @jansel in #951
- Add tile.count by @oulgen in #955
- Auto detect low vram by @oulgen in #956
- [CI] Use official PyTorch 2.9 by @oulgen in #962
- Use interleaved_bench for run_example by @jansel in #945
- Generalize tile_with_offset pass by @jansel in #949
- Docstring updates by @jansel in #952
- Import updates by @jansel in #953
- Add missing environment variables to docs by @jansel in #957
- Print out errors vs timeouts in autotuning status by @jansel in #960
- Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
- Exit autotuning faster on KeyboardInterrupt by @jansel in #963
- Remove default settings by @jansel in #964
- Add missing settings environment variables by @jansel in #965
- Skip test_differential_evolution_search due to slowness by @jansel in #968
- [Benchmark CI] Give nightly job permissions by @oulgen in #970
- [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
- [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
- [blackwell attn example] qk scale as param by @v0i0 in #969
- [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
- Add initial backwards compatibility tests by @oulgen in #958
- Remove unrolling + warp spec by @PaulZhang12 in #967
- [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
- [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
- [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
- Make fork default precompilation strategy by @oulgen in #979
- [benchmarks] change tritonbench path by @xuzhao9 in #966
- Add skipIfA10G decorator by @yf225 in #982
- Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
- Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
- Fix timeouts in autotuning by @jansel in #985
- Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
- Remove extra debug log for timeouts by @jansel in #987
- Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
- Generalize test cases to support XPU by @EikanWang in #983
- Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
- Update README.md by @oulgen in #992
- Update README.md by @oulgen in #993
- Mamba2 Chunk Scan & State by @v0i0 in #950
- Remove unrolling with tma + pipelining by @PaulZhang12 in #994
- Add provenance annotations to output code by @jansel in #988
Full Changelog: v0.1.8...v0.2.0
v0.1.8
What's Changed
- fix rmsnorm fwd tritonbench by @v0i0 in #840
- Update input shapes for example kernels by @yf225 in #845
- Extend eviction policy tests to all indexing types by @oulgen in #833
- [Docs] Remove early development warning by @oulgen in #846
- [Docs] Add link to gpumode discord by @oulgen in #847
- [Docs] Add PTC promotional material by @oulgen in #848
- [Benchmark] Add low mem dropout example by @karthickai in #641
- Update lint.yml by @oulgen in #854
- Remove
hl.register_reduction_dimAPI by @yf225 in #834 - Error message for boolean masking or torch.nonzero by @yf225 in #687
- Remove hardcoded
block_size=1usage in attention kernel example by @yf225 in #843 - Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
- Decrease
num_stagesdefault from 3 to 2, to avoid shared memory OOM by @yf225 in #841 - Allow user-defined specialization key by @jansel in #853
- [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
- Remove legacy
register_inductor_loweringcode by @yf225 in #864 - Set setstate/getstate methods to Config by @jansel in #868
- [doc] Add deployment/autotuning guide by @jansel in #869
- [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
- Fix sphinx warnings by @jansel in #871
- Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
- [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
- [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
- [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
- Print Triton code when error for easier debugging by @yf225 in #874
- Terminate autotuning faster if progress is minimal by @oulgen in #855
- Update README.md by @oulgen in #877
- [CI] pin b200 to pytorch2.9 by @oulgen in #878
- [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
- [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
- Install git for benchmarks by @oulgen in #882
- Pin AMD to 6.4.4 by @oulgen in #883
- Faster int4 gemm by @PaulZhang12 in #751
- Pin AMD to 6.4.4 by @oulgen in #881
- Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
- [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
- [Benchmark] Use bespoke setup-python action by @oulgen in #885
- [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
- Add dependabot by @oulgen in #888
- Update dependabot.yml by @oulgen in #891
- chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
- chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
- chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
- chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
- Upgrade ruff==0.14.0 by @jansel in #889
- [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
- chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
- [Benchmark] use logger.exception for process errors by @oulgen in #902
- [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
- Query minimum dot size for XPU by @EikanWang in #900
- Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
- [CI] Pin amd to rocm7.0 by @oulgen in #907
- [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
- [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
- [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
- Remove cache around set_triton_allocator by @oulgen in #912
- Add int4_gemm by @oulgen in #917
- chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
- Catch missing cudnn error by @jansel in #873
- Add progress bar for precompiling by @jansel in #919
- Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
- Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
- Avoid setting default
--input-sample-modetoequally-spaced-kby @yf225 in #922 - Remove
triton_helpers.*usage in lifted device function arguments by @yf225 in #849 - Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
- Suggest use of
@helion.kernel(index_dtype=torch.int64)if index offset is out of bound for int32 by @yf225 in #850 - Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
- Support
hl.arange()with non-power-of-2 input by @yf225 in #862 - Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
- Generalize examples with the DEVICE variable by @adam-smnk in #915
- Fix lint error by @jansel in #926
- Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
- Support tile+offset and tensor descriptors by @jansel in #928
- Fix triton/torch.compile compability issue by @jansel in #927
- Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
- Update the Agent ID by @sekyondaMeta in #931
- [Benchmark CI] Use
--non-squareflag for gemm by @yf225 in #938
New Contributors
- @dependabot[bot] made their first contribution in #893
- @tianrengao made their first contribution in #748
Full Changelog: v0.1.7...v0.1.8
v0.1.7
What's Changed
- Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
- Make progress bar prettier by @oulgen in #786
- Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
- Add hl.split and hl.join by @jansel in #791
- Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
- Limit rebench to 1000 iterations by @jansel in #789
- Turn down autotuner defaults by @jansel in #788
- Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
- Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
- Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
- Support 0dim tensor in output code printing by @oulgen in #806
- Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
- Add hl.inline_triton API by @jansel in #811
- Add out_dtype arg to hl.dot by @jansel in #813
- Add autotune_config_overrides by @jansel in #814
- Reduce initial_population to 100 by @jansel in #800
- Disable range_num_stages for kernels with aliasing by @jansel in #812
- Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
- Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
- Update docs by @jansel in #815
- Fix torch version check by @adam-smnk in #818
- [Benchmark] Keep going when a single benchmark fails by @oulgen in #820
- Faster Helion JSD by @PaulZhang12 in #733
- Faster KL Div by @PaulZhang12 in #822
- Normalize device name and decorate cuda-only test cases by @EikanWang in #819
- Improved log messages for autotuning by @choijon5 in #817
- Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
- Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
- Match cuda versions for benchmark by @oulgen in #828
- Print nvidia-smi/rocminfo by @oulgen in #827
- Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
- Add 3.14 support by @oulgen in #830
- Remove py312 vanilla test by @oulgen in #831
- Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
- Autotune eviction policy by @oulgen in #823
- [Docs] Consistent pre-commit/lint by @oulgen in #836
- [Docs] Recommend venv instead of conda by @oulgen in #837
- [Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
- [Docs] Add eviction policy by @oulgen in #839
- Update to use the new attribute setting for tf32. by @choijon5 in #835
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- ci: Always auth for benchmarking workflows by @seemethere in #719
- [Benchmark] jagged_sum kernel and test by @Sibylau in #676
- Skip default config printing if in ref eager mode by @yf225 in #721
- [Benchmark CI] Make benchmark runner respect custom CLI args by @yf225 in #723
- Upgrade rocm CI to 7.0 by @oulgen in #720
- Add eviction policy argument to tl.load by @oulgen in #714
- [CI] use complete rocm docker images by @oulgen in #724
- More inconsistent naming by @oulgen in #725
- [Benchmark] jagged_layer_norm kernel and test by @Sibylau in #704
- [Bug fix] Preserve masks on reduction inputs that depend on reduction outputs; fix layer_norm accuracy check failure by @yf225 in #722
- Support torch.matmul with 3D inputs by @yf225 in #715
- Slightly improve logs by @angelayi in #740
- Autotuning Progress Bar by @msaroufim in #739
- make tritonbench optional in run.py so install works again by @v0i0 in #746
- fix new factory when size comes from kwargs by @v0i0 in #750
- Add linting instructions to README by @msaroufim in #763
- Add backward kernel for exp by @aditvenk in #736
- fix roll reduction meta when for ops with none output (like wait), cl… by @v0i0 in #767
- Move upload benchmark results to a separate workflows by @huydhn in #758
- Add flash_attention to benchmarks by @oulgen in #769
- Fix jagged_layer_norm linter error by @yf225 in #770
- Add SIGINT handler for clean interrupt of autotuning background processes by @msaroufim in #766
- Enable tensor descriptor for XPU by @EikanWang in #765
- Fix the issue that the XPU kernels cannot be cached well by @EikanWang in #761
- Print Helion kernel source line in symbolic shape debugging by @yf225 in #771
- ci: Set fail-fast to false by @seemethere in #776
- Add XPU support for RNG operations by @EikanWang in #774
- Enable test_dot for XPU by @EikanWang in #773
- Handle XPU compilation error by @adam-smnk in #779
- Fix type prop for and/or by @oulgen in #781
- Make print output code more robust by @oulgen in #780
- Revert "Add SIGINT handler for clean interrupt of autotuning background processes" by @oulgen in #784
- Add torch compile unit test to helion by @oulgen in #782
New Contributors
- @seemethere made their first contribution in #719
- @angelayi made their first contribution in #740
- @msaroufim made their first contribution in #739
- @aditvenk made their first contribution in #736
- @EikanWang made their first contribution in #765
- @adam-smnk made their first contribution in #779
Full Changelog: v0.1.5...v0.1.6
v0.1.5
v0.1.4
What's Changed
- Update benchmark.yml by @oulgen in #570
- Update benchmark.yml by @oulgen in #571
- [Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
- Add rms norm and cross entropy by @oulgen in #568
- Update benchmark_dispatch.yml by @oulgen in #573
- Update linters by @oulgen in #569
- Print config for PassManager::run triton errors by @jansel in #565
- Error when invalid loop reduction number config is generated by @oulgen in #572
- Add
skipIfLowVRAMoruse_default_config=Falseto specific unit tests to enable local testing by @yf225 in #574 - Fix bug with block_size smaller than minimum by @jansel in #575
- Better shape errors for mismatched tile sizes by @jansel in #566
- Print warning if block_size is specified in interpret mode. by @choijon5 in #576
- Run all shapes for benchmarks by @oulgen in #578
- [Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
- [Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
- [Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
- Do not benchmark twice by @oulgen in #583
- Add missing functions to docs by @jansel in #586
- hl.atomic_add: support 1D tensor as index by @yf225 in #587
- Add atomic and/or/min/max/cas/xchg by @jansel in #589
- Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
- Add link to github to docs by @jansel in #591
- Support layernorm without bias by @mengluy0125 in #585
- Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
- Add layer_norm backward kernels by @yf225 in #588
- Fix tf32 warning by @jansel in #592
- [Benchmark] geglu example and test by @Sibylau in #582
- Print default config when running with it by @oulgen in #599
- [Benchmark] swiglu example and test by @Sibylau in #584
- Login to Docker from the workflows by @huydhn in #601
- Add rms_norm backward kernels by @mengluy0125 in #597
- Revert "Login to Docker from the workflows" by @oulgen in #604
- Fix static shape typo by @oulgen in #609
- Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using
tl.dot(acc=...)for addmm / baddbmm by @yf225 in #564 - Fix rms_norm and layer_norm by @mengluy0125 in #603
- [Benchmark] jsd kernel and test by @Sibylau in #611
- Refactor autotune error handling by @jansel in #595
- Possible fix for CI failures by @jansel in #617
- [Benchmark] Welford kernel and example by @karthickai in #614
- [Benchmark] kl_div kernel and test by @Sibylau in #615
- Ignore TServiceRouterException errors while autotuning by @jansel in #618
- [Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
- Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
- Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
- Add more kernels to benchmarking by @oulgen in #632
- Reorder benchmarks by @oulgen in #633
- [Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
- Support using block size var outside of hl.tile loop by @yf225 in #619
- [Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
- Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
- Always clear inductor cache before benchmark by @yf225 in #608
- Make hl.specialize work on sequences by @jansel in #636
- Better error for passing Tile to hl.tile by @jansel in #640
- [Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
- int4_gemm: remove use_default_config=True by @yf225 in #639
- [Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
- Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
- Add
HELION_AUTOTUNE_RANDOM_SEEDenv var andautotune_random_seedsetting by @yf225 in #644 - Bump linter by @oulgen in #647
- Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
- Fix lint related to welford and also local_cache by @yf225 in #646
- Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
- PT Sphinx Theme Test by @sekyondaMeta in #600
- Print
static_shapessettings value along with config for accurate repro by @yf225 in #649 - [Benchmark] gather_gemv kernel and test by @Sibylau in #635
- Add HELION_SKIP_CACHE env by @jansel in #653
- [lint] Remove UP038 reference by @jansel in #650
- Fix
register_block_sizecodegen by @yf225 in #659 - Raise better error when
hl.atomic_*is used on device tensor by @yf225 in #658 - [Autotune] Filter bad config with accuracy check by @yf225 in #655
- Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
- Log autotune random seed for easier repro by @yf225 in #661
- Fix misaligned address error for matmul by @yf225 in #662
- skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
- rms_norm: get weight from function args by @yf225 in #664
- skip full autotune if configs are provided by @xuanzhang816 in #670
- [example] fused_linear_jsd by @v0i0 in #494
- Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
- No background image by @sekyondaMeta in #663
- Remove github link from index.md by @oulgen in #675
- [Autotune] Allow skipping Triton compilation error by @yf225 in #679
- [Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
- Fix missing block size constexpr assignment in host code by @yf225 in #678
- [CI] Fix missing setuptools by @yf225 in #680
- faster rms norm backwards kernel by @v0i0 in #624
- [Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
- [Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
- Increase tolerance for _validate_against_baseline by @jansel in #691
- [Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
- Print bad default config if compute baseline fails by @yf225 in #688
- Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
- rms norm: improve fwd perf by @v0i0 in #669
- Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
- [Autotune] Skip Triton shared memory OOM by @yf225 in https://git...
v0.1.3
What's Changed
- Add torch compile to benchmark by @oulgen in #545
- Fix issues with wrong dtypes in generated code by @jansel in #542
- Limit concurrent precompile jobs while autotuning by @jansel in #543
- Create basic helion benchmark runner by @oulgen in #544
- Add multi selection radio buttons by @oulgen in #547
- Fix benchmark condition by @oulgen in #548
- Move to dispatcher model for benchmarking by @oulgen in #549
- Give permissions by @oulgen in #550
- Do not downgrade torch/triton by @oulgen in #551
- Use uv for pip freeze by @oulgen in #552
- Add jagged hstu attention example (i.e. ragged_attention) by @xuanzhang816 in #554
- Install quack/torchbench with no deps by @oulgen in #553
- Update test-reports dir by @oulgen in #556
- torch.rand_like and torch.randn_like support by @yf225 in #530
- [Benchmark] add addmm example and test by @Sibylau in #555
- Kick off benchmarks at midnight by @oulgen in #559
- Use profiler instead of inductor_benchmarker by @oulgen in #560
- Shard kernels by @oulgen in #561
- Add layer_norm and softmax by @oulgen in #562
- [Fix CI] Convert tiles to sizes for all torch.* functions by @yf225 in #563
Full Changelog: v0.1.2...v0.1.3
v0.1.2
What's Changed
- Support symbolic range with multiples of block-size as length by @yf225 in #509
- Handle new return type of triton's JITFunction.create_binder(). by @gueraf in #517
- Lower symbolic slices to hl.arange by @yf225 in #518
- Add call function to triton output for easier repros by @oulgen in #514
- Prevent naming conflicts in expr_from_string placeholder replacement by @yf225 in #519
- Display error message when too many arguments are passed by @oulgen in #526
- Fix missing min block size for hl.dot by @jansel in #522
- Swap from gcc13 to clang14 by @oulgen in #537
- Pass int arg instead of dummy tensor into example kernels by @yf225 in #538
- Fix local runs for test_triton_repro tests by @yf225 in #539
- Update expected test results by @oulgen in #541
- Prepare benchmark runner for running on CI by @oulgen in #534
- Error message for kernel without device loop by @jansel in #531
- Upgrade ruff/pyright by @jansel in #532
- Allow dynamic fill values in full by @jansel in #533
- Add torch.stack support by @yf225 in #524
- Fixes #447: throw an error when printing output code in eager mode by @Sibylau in #528
New Contributors
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- [Benchmark] Avoid using _run in TritonBench integration by @yf225 in #444
- Add H100 CI by @oulgen in #435
- Add B200 CI by @oulgen in #436
- Skip illegal memory access for autotuning by @oulgen in #453
- Re-enable associative_scan tests in ref eager mode by @yf225 in #443
- Fix tritonbench integration issue by @yf225 in #463
- [Benchmark] Allow passing kwargs; Set static_shape = True for better benchmark perf by @yf225 in #465
- [Example] One shot all reduce by @joydddd in #245
- Fix lint by @oulgen in #469
- Improve signal/wait doc by @joydddd in #478
- Cleanup ci by @oulgen in #449
- Run CI on mi325x by @oulgen in #441
- Improve Stacktensor Doc by @joydddd in #479
- Require tests to be faster than 30s by @oulgen in #471
- Improve error message when no good config is found by @oulgen in #455
- Add SequenceType Eq comparison by @oulgen in #482
- [Benchmark] Add try-catch for tritonbench import path by @yf225 in #487
- Add helion prefix to Triton kernel name by @yf225 in #486
- Support GraphModule inputs by @jansel in #488
- Improve stack trace for #457 by @jansel in #489
- [EZ] Replace
pytorch-labswithmeta-pytorchby @ZainRizvi in #490 - [generate_ast] providing AST args, and fall back to
api._codegenwhen output is a tuple by @HanGuo97 in #481 - Support reshape with block_size expressions by @yf225 in #495
- [example] add jagged_softmax example by @pianpwk in #480
- Fix handling of fixed size reductions by @jansel in #499
- Improve error message for rank mismatch in control flow by @jansel in #502
- Fix reshape + sum case by @yf225 in #504
- Sort config keys alphabetically in
__str__by @yf225 in #505 - Fix issue with fp64 constants by @jansel in #506
New Contributors
- @ZainRizvi made their first contribution in #490
- @HanGuo97 made their first contribution in #481
- @pianpwk made their first contribution in #480
Full Changelog: v0.1.0...v0.1.1