Skip to content

Releases: pytorch/helion

v0.2.1

26 Oct 23:16
c5dbbbe

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0

20 Oct 20:54
3a0e975

Choose a tag to compare

What's Changed

  • Verify compiled kernels in subprocess by @jansel in #914
  • Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
  • Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
  • Support warp specialization on B200 by @oulgen in #935
  • Update README.md by @oulgen in #943
  • Register tile symbol origin, to support tile + offset use case in blackwell attention by @yf225 in #939
  • [CI] Print failed tests by @oulgen in #942
  • Update examples to use run_example by @jansel in #941
  • blackwell attn with triton attr set by @v0i0 in #918
  • Set static_shapes=True by @oulgen in #937
  • run.py env var to skip exception logging by @v0i0 in #946
  • Fix bug with unit sized dims and block_sizes by @jansel in #932
  • Update static_shapes docs by @jansel in #951
  • Add tile.count by @oulgen in #955
  • Auto detect low vram by @oulgen in #956
  • [CI] Use official PyTorch 2.9 by @oulgen in #962
  • Use interleaved_bench for run_example by @jansel in #945
  • Generalize tile_with_offset pass by @jansel in #949
  • Docstring updates by @jansel in #952
  • Import updates by @jansel in #953
  • Add missing environment variables to docs by @jansel in #957
  • Print out errors vs timeouts in autotuning status by @jansel in #960
  • Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
  • Exit autotuning faster on KeyboardInterrupt by @jansel in #963
  • Remove default settings by @jansel in #964
  • Add missing settings environment variables by @jansel in #965
  • Skip test_differential_evolution_search due to slowness by @jansel in #968
  • [Benchmark CI] Give nightly job permissions by @oulgen in #970
  • [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
  • [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
  • [blackwell attn example] qk scale as param by @v0i0 in #969
  • [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
  • Add initial backwards compatibility tests by @oulgen in #958
  • Remove unrolling + warp spec by @PaulZhang12 in #967
  • [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
  • [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
  • [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
  • Make fork default precompilation strategy by @oulgen in #979
  • [benchmarks] change tritonbench path by @xuzhao9 in #966
  • Add skipIfA10G decorator by @yf225 in #982
  • Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
  • Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
  • Fix timeouts in autotuning by @jansel in #985
  • Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
  • Remove extra debug log for timeouts by @jansel in #987
  • Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
  • Generalize test cases to support XPU by @EikanWang in #983
  • Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
  • Update README.md by @oulgen in #992
  • Update README.md by @oulgen in #993
  • Mamba2 Chunk Scan & State by @v0i0 in #950
  • Remove unrolling with tma + pipelining by @PaulZhang12 in #994
  • Add provenance annotations to output code by @jansel in #988

Full Changelog: v0.1.8...v0.2.0

v0.1.8

15 Oct 00:37
b77301f

Choose a tag to compare

What's Changed

  • fix rmsnorm fwd tritonbench by @v0i0 in #840
  • Update input shapes for example kernels by @yf225 in #845
  • Extend eviction policy tests to all indexing types by @oulgen in #833
  • [Docs] Remove early development warning by @oulgen in #846
  • [Docs] Add link to gpumode discord by @oulgen in #847
  • [Docs] Add PTC promotional material by @oulgen in #848
  • [Benchmark] Add low mem dropout example by @karthickai in #641
  • Update lint.yml by @oulgen in #854
  • Remove hl.register_reduction_dim API by @yf225 in #834
  • Error message for boolean masking or torch.nonzero by @yf225 in #687
  • Remove hardcoded block_size=1 usage in attention kernel example by @yf225 in #843
  • Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
  • Decrease num_stages default from 3 to 2, to avoid shared memory OOM by @yf225 in #841
  • Allow user-defined specialization key by @jansel in #853
  • [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
  • Remove legacy register_inductor_lowering code by @yf225 in #864
  • Set setstate/getstate methods to Config by @jansel in #868
  • [doc] Add deployment/autotuning guide by @jansel in #869
  • [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
  • Fix sphinx warnings by @jansel in #871
  • Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
  • [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
  • [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
  • [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
  • Print Triton code when error for easier debugging by @yf225 in #874
  • Terminate autotuning faster if progress is minimal by @oulgen in #855
  • Update README.md by @oulgen in #877
  • [CI] pin b200 to pytorch2.9 by @oulgen in #878
  • [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
  • [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
  • Install git for benchmarks by @oulgen in #882
  • Pin AMD to 6.4.4 by @oulgen in #883
  • Faster int4 gemm by @PaulZhang12 in #751
  • Pin AMD to 6.4.4 by @oulgen in #881
  • Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
  • [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
  • [Benchmark] Use bespoke setup-python action by @oulgen in #885
  • [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
  • Add dependabot by @oulgen in #888
  • Update dependabot.yml by @oulgen in #891
  • chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
  • chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
  • chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
  • chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
  • Upgrade ruff==0.14.0 by @jansel in #889
  • [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
  • chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
  • [Benchmark] use logger.exception for process errors by @oulgen in #902
  • [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
  • Query minimum dot size for XPU by @EikanWang in #900
  • Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
  • [CI] Pin amd to rocm7.0 by @oulgen in #907
  • [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
  • [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
  • [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
  • Remove cache around set_triton_allocator by @oulgen in #912
  • Add int4_gemm by @oulgen in #917
  • chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
  • Catch missing cudnn error by @jansel in #873
  • Add progress bar for precompiling by @jansel in #919
  • Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
  • Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
  • Avoid setting default --input-sample-mode to equally-spaced-k by @yf225 in #922
  • Remove triton_helpers.* usage in lifted device function arguments by @yf225 in #849
  • Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
  • Suggest use of @helion.kernel(index_dtype=torch.int64) if index offset is out of bound for int32 by @yf225 in #850
  • Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
  • Support hl.arange() with non-power-of-2 input by @yf225 in #862
  • Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
  • Generalize examples with the DEVICE variable by @adam-smnk in #915
  • Fix lint error by @jansel in #926
  • Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
  • Support tile+offset and tensor descriptors by @jansel in #928
  • Fix triton/torch.compile compability issue by @jansel in #927
  • Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
  • Update the Agent ID by @sekyondaMeta in #931
  • [Benchmark CI] Use --non-square flag for gemm by @yf225 in #938

New Contributors

Full Changelog: v0.1.7...v0.1.8

v0.1.7

08 Oct 19:16
269deb3

Choose a tag to compare

What's Changed

  • Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
  • Make progress bar prettier by @oulgen in #786
  • Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
  • Add hl.split and hl.join by @jansel in #791
  • Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
  • Limit rebench to 1000 iterations by @jansel in #789
  • Turn down autotuner defaults by @jansel in #788
  • Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
  • Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
  • Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
  • Support 0dim tensor in output code printing by @oulgen in #806
  • Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
  • Add hl.inline_triton API by @jansel in #811
  • Add out_dtype arg to hl.dot by @jansel in #813
  • Add autotune_config_overrides by @jansel in #814
  • Reduce initial_population to 100 by @jansel in #800
  • Disable range_num_stages for kernels with aliasing by @jansel in #812
  • Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
  • Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
  • Update docs by @jansel in #815
  • Fix torch version check by @adam-smnk in #818
  • [Benchmark] Keep going when a single benchmark fails by @oulgen in #820
  • Faster Helion JSD by @PaulZhang12 in #733
  • Faster KL Div by @PaulZhang12 in #822
  • Normalize device name and decorate cuda-only test cases by @EikanWang in #819
  • Improved log messages for autotuning by @choijon5 in #817
  • Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
  • Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
  • Match cuda versions for benchmark by @oulgen in #828
  • Print nvidia-smi/rocminfo by @oulgen in #827
  • Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
  • Add 3.14 support by @oulgen in #830
  • Remove py312 vanilla test by @oulgen in #831
  • Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
  • Autotune eviction policy by @oulgen in #823
  • [Docs] Consistent pre-commit/lint by @oulgen in #836
  • [Docs] Recommend venv instead of conda by @oulgen in #837
  • [Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
  • [Docs] Add eviction policy by @oulgen in #839
  • Update to use the new attribute setting for tf32. by @choijon5 in #835

Full Changelog: v0.1.6...v0.1.7

v0.1.6

02 Oct 22:32
3322ca9

Choose a tag to compare

What's Changed

  • ci: Always auth for benchmarking workflows by @seemethere in #719
  • [Benchmark] jagged_sum kernel and test by @Sibylau in #676
  • Skip default config printing if in ref eager mode by @yf225 in #721
  • [Benchmark CI] Make benchmark runner respect custom CLI args by @yf225 in #723
  • Upgrade rocm CI to 7.0 by @oulgen in #720
  • Add eviction policy argument to tl.load by @oulgen in #714
  • [CI] use complete rocm docker images by @oulgen in #724
  • More inconsistent naming by @oulgen in #725
  • [Benchmark] jagged_layer_norm kernel and test by @Sibylau in #704
  • [Bug fix] Preserve masks on reduction inputs that depend on reduction outputs; fix layer_norm accuracy check failure by @yf225 in #722
  • Support torch.matmul with 3D inputs by @yf225 in #715
  • Slightly improve logs by @angelayi in #740
  • Autotuning Progress Bar by @msaroufim in #739
  • make tritonbench optional in run.py so install works again by @v0i0 in #746
  • fix new factory when size comes from kwargs by @v0i0 in #750
  • Add linting instructions to README by @msaroufim in #763
  • Add backward kernel for exp by @aditvenk in #736
  • fix roll reduction meta when for ops with none output (like wait), cl… by @v0i0 in #767
  • Move upload benchmark results to a separate workflows by @huydhn in #758
  • Add flash_attention to benchmarks by @oulgen in #769
  • Fix jagged_layer_norm linter error by @yf225 in #770
  • Add SIGINT handler for clean interrupt of autotuning background processes by @msaroufim in #766
  • Enable tensor descriptor for XPU by @EikanWang in #765
  • Fix the issue that the XPU kernels cannot be cached well by @EikanWang in #761
  • Print Helion kernel source line in symbolic shape debugging by @yf225 in #771
  • ci: Set fail-fast to false by @seemethere in #776
  • Add XPU support for RNG operations by @EikanWang in #774
  • Enable test_dot for XPU by @EikanWang in #773
  • Handle XPU compilation error by @adam-smnk in #779
  • Fix type prop for and/or by @oulgen in #781
  • Make print output code more robust by @oulgen in #780
  • Revert "Add SIGINT handler for clean interrupt of autotuning background processes" by @oulgen in #784
  • Add torch compile unit test to helion by @oulgen in #782

New Contributors

Full Changelog: v0.1.5...v0.1.6

v0.1.5

29 Sep 18:29
994aaf9

Choose a tag to compare

What's Changed

  • [Benchmark CI] Print config that causes tritonbench accuracy check failure by @yf225 in #716
  • Add AMD to benchmarks by @oulgen in #717
  • [Docs] Move docs requirements to docs/requirements.txt to make compatible with pypi by @oulgen in #718

Full Changelog: v0.1.4...v0.1.5

v0.1.4

29 Sep 16:17
0428d5d

Choose a tag to compare

What's Changed

  • Update benchmark.yml by @oulgen in #570
  • Update benchmark.yml by @oulgen in #571
  • [Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
  • Add rms norm and cross entropy by @oulgen in #568
  • Update benchmark_dispatch.yml by @oulgen in #573
  • Update linters by @oulgen in #569
  • Print config for PassManager::run triton errors by @jansel in #565
  • Error when invalid loop reduction number config is generated by @oulgen in #572
  • Add skipIfLowVRAM or use_default_config=False to specific unit tests to enable local testing by @yf225 in #574
  • Fix bug with block_size smaller than minimum by @jansel in #575
  • Better shape errors for mismatched tile sizes by @jansel in #566
  • Print warning if block_size is specified in interpret mode. by @choijon5 in #576
  • Run all shapes for benchmarks by @oulgen in #578
  • [Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
  • [Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
  • [Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
  • Do not benchmark twice by @oulgen in #583
  • Add missing functions to docs by @jansel in #586
  • hl.atomic_add: support 1D tensor as index by @yf225 in #587
  • Add atomic and/or/min/max/cas/xchg by @jansel in #589
  • Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
  • Add link to github to docs by @jansel in #591
  • Support layernorm without bias by @mengluy0125 in #585
  • Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
  • Add layer_norm backward kernels by @yf225 in #588
  • Fix tf32 warning by @jansel in #592
  • [Benchmark] geglu example and test by @Sibylau in #582
  • Print default config when running with it by @oulgen in #599
  • [Benchmark] swiglu example and test by @Sibylau in #584
  • Login to Docker from the workflows by @huydhn in #601
  • Add rms_norm backward kernels by @mengluy0125 in #597
  • Revert "Login to Docker from the workflows" by @oulgen in #604
  • Fix static shape typo by @oulgen in #609
  • Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using tl.dot(acc=...) for addmm / baddbmm by @yf225 in #564
  • Fix rms_norm and layer_norm by @mengluy0125 in #603
  • [Benchmark] jsd kernel and test by @Sibylau in #611
  • Refactor autotune error handling by @jansel in #595
  • Possible fix for CI failures by @jansel in #617
  • [Benchmark] Welford kernel and example by @karthickai in #614
  • [Benchmark] kl_div kernel and test by @Sibylau in #615
  • Ignore TServiceRouterException errors while autotuning by @jansel in #618
  • [Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
  • Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
  • Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
  • Add more kernels to benchmarking by @oulgen in #632
  • Reorder benchmarks by @oulgen in #633
  • [Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
  • Support using block size var outside of hl.tile loop by @yf225 in #619
  • [Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
  • Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
  • Always clear inductor cache before benchmark by @yf225 in #608
  • Make hl.specialize work on sequences by @jansel in #636
  • Better error for passing Tile to hl.tile by @jansel in #640
  • [Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
  • int4_gemm: remove use_default_config=True by @yf225 in #639
  • [Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
  • Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
  • Add HELION_AUTOTUNE_RANDOM_SEED env var and autotune_random_seed setting by @yf225 in #644
  • Bump linter by @oulgen in #647
  • Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
  • Fix lint related to welford and also local_cache by @yf225 in #646
  • Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
  • PT Sphinx Theme Test by @sekyondaMeta in #600
  • Print static_shapes settings value along with config for accurate repro by @yf225 in #649
  • [Benchmark] gather_gemv kernel and test by @Sibylau in #635
  • Add HELION_SKIP_CACHE env by @jansel in #653
  • [lint] Remove UP038 reference by @jansel in #650
  • Fix register_block_size codegen by @yf225 in #659
  • Raise better error when hl.atomic_* is used on device tensor by @yf225 in #658
  • [Autotune] Filter bad config with accuracy check by @yf225 in #655
  • Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
  • Log autotune random seed for easier repro by @yf225 in #661
  • Fix misaligned address error for matmul by @yf225 in #662
  • skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
  • rms_norm: get weight from function args by @yf225 in #664
  • skip full autotune if configs are provided by @xuanzhang816 in #670
  • [example] fused_linear_jsd by @v0i0 in #494
  • Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
  • No background image by @sekyondaMeta in #663
  • Remove github link from index.md by @oulgen in #675
  • [Autotune] Allow skipping Triton compilation error by @yf225 in #679
  • [Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
  • Fix missing block size constexpr assignment in host code by @yf225 in #678
  • [CI] Fix missing setuptools by @yf225 in #680
  • faster rms norm backwards kernel by @v0i0 in #624
  • [Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
  • [Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
  • Increase tolerance for _validate_against_baseline by @jansel in #691
  • [Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
  • Print bad default config if compute baseline fails by @yf225 in #688
  • Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
  • rms norm: improve fwd perf by @v0i0 in #669
  • Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
  • [Autotune] Skip Triton shared memory OOM by @yf225 in https://git...
Read more

v0.1.3

05 Sep 00:49
a61bd17

Choose a tag to compare

What's Changed

Full Changelog: v0.1.2...v0.1.3

v0.1.2

02 Sep 18:35
d2bf84d

Choose a tag to compare

What's Changed

  • Support symbolic range with multiples of block-size as length by @yf225 in #509
  • Handle new return type of triton's JITFunction.create_binder(). by @gueraf in #517
  • Lower symbolic slices to hl.arange by @yf225 in #518
  • Add call function to triton output for easier repros by @oulgen in #514
  • Prevent naming conflicts in expr_from_string placeholder replacement by @yf225 in #519
  • Display error message when too many arguments are passed by @oulgen in #526
  • Fix missing min block size for hl.dot by @jansel in #522
  • Swap from gcc13 to clang14 by @oulgen in #537
  • Pass int arg instead of dummy tensor into example kernels by @yf225 in #538
  • Fix local runs for test_triton_repro tests by @yf225 in #539
  • Update expected test results by @oulgen in #541
  • Prepare benchmark runner for running on CI by @oulgen in #534
  • Error message for kernel without device loop by @jansel in #531
  • Upgrade ruff/pyright by @jansel in #532
  • Allow dynamic fill values in full by @jansel in #533
  • Add torch.stack support by @yf225 in #524
  • Fixes #447: throw an error when printing output code in eager mode by @Sibylau in #528

New Contributors

Full Changelog: v0.1.1...v0.1.2

v0.1.1

21 Aug 20:40
2e1ea33

Choose a tag to compare

What's Changed

  • [Benchmark] Avoid using _run in TritonBench integration by @yf225 in #444
  • Add H100 CI by @oulgen in #435
  • Add B200 CI by @oulgen in #436
  • Skip illegal memory access for autotuning by @oulgen in #453
  • Re-enable associative_scan tests in ref eager mode by @yf225 in #443
  • Fix tritonbench integration issue by @yf225 in #463
  • [Benchmark] Allow passing kwargs; Set static_shape = True for better benchmark perf by @yf225 in #465
  • [Example] One shot all reduce by @joydddd in #245
  • Fix lint by @oulgen in #469
  • Improve signal/wait doc by @joydddd in #478
  • Cleanup ci by @oulgen in #449
  • Run CI on mi325x by @oulgen in #441
  • Improve Stacktensor Doc by @joydddd in #479
  • Require tests to be faster than 30s by @oulgen in #471
  • Improve error message when no good config is found by @oulgen in #455
  • Add SequenceType Eq comparison by @oulgen in #482
  • [Benchmark] Add try-catch for tritonbench import path by @yf225 in #487
  • Add helion prefix to Triton kernel name by @yf225 in #486
  • Support GraphModule inputs by @jansel in #488
  • Improve stack trace for #457 by @jansel in #489
  • [EZ] Replace pytorch-labs with meta-pytorch by @ZainRizvi in #490
  • [generate_ast] providing AST args, and fall back to api._codegen when output is a tuple by @HanGuo97 in #481
  • Support reshape with block_size expressions by @yf225 in #495
  • [example] add jagged_softmax example by @pianpwk in #480
  • Fix handling of fixed size reductions by @jansel in #499
  • Improve error message for rank mismatch in control flow by @jansel in #502
  • Fix reshape + sum case by @yf225 in #504
  • Sort config keys alphabetically in __str__ by @yf225 in #505
  • Fix issue with fp64 constants by @jansel in #506

New Contributors

Full Changelog: v0.1.0...v0.1.1