Skip to content

Hidet v0.6.0

Choose a tag to compare

@yaoyaoding yaoyaoding released this 26 May 04:49
· 34 commits to main since this release
20070aa

What's Changed

  • [Dependency] Move pygraphviz to dev dependency
  • [Fix] Wrapup the release
  • [CI] Use default runner instead of large-runners
  • [Docs] Move release guide
  • [Apps] Phase out the app-level abstraction
  • [Security] Fix the security issue of using tempfile.mktemp()
  • [BUG] Simplify symbol variables
  • [CI] Add CI requirements installation
  • [Bugfix] Add signal handler to clean up NCCL sync file
  • [CI] Add permissions.contents: read to all Workflows
  • [Fix] fix ci failure due to interface change of mbarrier_try_wait
  • [PERF] Flattening batch dimensions for batched matmul
  • [CI] Fix Nightly Workflow
  • [CI] Using base Docker image for all functional tests
  • [Feature][CuTe] add mbarrier operators in Hexcute
  • [FEATURE] Add flow graph visualization
  • [Release][Wrapup] Prepare for the release
  • [BUG][CI] Fix the CI failure caused by PyTorch version 2.7.0
  • [Fix] fix matmul
  • [Bug] Release the reserved memory in hidet for kv cache
  • [Bugfix] Add rank information to flowgraph cache hash key
  • [BUG] Fix memory error triggered while compiling model with cuBLAS
  • [TESTs] Add more tests for torch.compile and split op
  • [Bugfix] Change grid dimension to support large batch size
  • [PERF] Enabled interval dispatch table by default
  • [FEATURE] fp8_scaled_mm
  • [Bugfix] Fixes and refactors to support Deepseek R1 compilation
  • [Fix] Fix the mma config name for int8 tensor core
  • [Graph Cache] Dump graph visual to cache when needed
  • [HOTFIX] Hot fix for current CI fails
  • [PERF] Speedup broadcast
  • [PERF] Improving Expr simplification
  • [Package] Refactor dependency configuration
  • [Dependents] Upgrade black to 25.1.0
  • [Refactor] Refactor the property methods of data types
  • [FEATURE] fp8_mm
  • [Dependents] Remove the restriction on jinja2 version in docs building
  • [Dependents] Remove gpt2 example with tensorflow dependency
  • [PERF] Graph dispatch table optimization and support for nested shapes
  • [Hopper] add wgmma inst in hexcute
  • [Feature] Support FlowGraph to CompiledGraph cache
  • [Fixbug] Fix a bug in the instantiate_symbols pass
  • [Options] Add options to control two nvcc compilation flags
  • [Enhancement] Add function to gather unsupported ops
  • [Pass] Optimization for addition chain
  • [Perf] Support fused_moe_awq_gptq
  • [Feature] finalize warp specialization
  • [CI][Fix] trap heartbeat logging to ensure it exits if build-docs fails
  • [Bugfix] Cast tensor shape to int64 when computing tensor nbytes
  • [Codegen][Runtime] Add try-catch to protect the public function
  • [PERF] Implement identity op
  • [FIX][BUG] remove assign stmt in code generation in cute
  • [CI][Fix] trap heartbeat logging to ensure it exits for any failed tests
  • [IR][Runtime] Support pointer type symbolic variable
  • [BUG] Fix the complex expr shows up in shape
  • [Hopper] add a cost model
  • [CI][Fix] use same fix for build-docs in all test workflows
  • [BUG] Fix cuBLAS error occurred when serving Llama-3.1-8B model
  • [BUG] Fix torch.nn.functional.group_norm implementation
  • [Utils] Add utility function to launch compute-sanitizer
  • [FEATURE] Various Operator Support + Bug Fixes
  • [Feature] add wgmma fence operand
  • [Hidet Script] Add support for lambda and fix assignment issue
  • [BUG] fix flatten tensor index pass
  • [Wheel] Fix a bug when we install the package with pip install .
  • [Fix][Primitives] fix cp_async_bulk_tensor_s2g
  • [BUG] Fix broken mbarrier CI test
  • [PERF] Add graph rewrite rule: Transpose(B) + Matmul -> MatmulNT
  • [FEATURE] Add support for warp specialization context managers
  • [Docs] Update the copyright year
  • [Fix] fix reduce test failure on H100.
  • [CI][PR Title] Allow multiple categories in PR title
  • [CI] fix test workflow gpu params for push event github action
  • [CI] Use hidet api in benchmarking for Llama MLP layer
  • [CI] run Tests workflow on l4, h100 by default, only require l4 success
  • [CI] add extra logging to build-docs, set lower priority for make step
  • [PERF] Minimized version of dispatch table with options
  • [Operators] Add support for operator.floordiv
  • [Fixbug] Fix a bug in runtime that does not update workspace size
  • [FIX] Fix deploy doc workflows
  • [BUG] Avoid comp of view during call Tensor.torch for fp8 tensor
  • [BUG] Fix choosing stream in hidet.Event.record()
  • [PROJECT] Create pyproject.toml
  • [FEATURE] Automatically deploy docs to website
  • [FEATURE] Add fp8 wgmma, mma support
  • [HIP] Add HipGraph class and related HIP graph functionalities
  • [BUGFIX] Fix vLLM backend parallel build failure
  • [FEATURE] Add fp8 (e4m3,e5m2) support
  • [AMD] mma for float16 and float32 on AMD GPU (gfx90a/MI200s)
  • [FEATURE] Increase the accuracy of benchmarking of small kernels
  • [HIP] Switch to using HIP Python for HIP runtime wrappers
  • [Bug] Fix the way to lint the python source code
  • [AMD] Support batch matmul with matrix core
  • [FEATURE] Tensor view operator
  • [CI] Use torch 2.6.0 for Perf Tests
  • [AMD] mfma instructions for gfx90a
  • [FEATURE] In compilation server clean memory after every compilation
  • [FEATURE] Use permanent processes to handle fixed commits in the compilation server
  • [HIP] Support f32 matmul and llama end to end example
  • [Format] Show progress bar to formatting process
  • [HIP] Resnet end to end example
  • [COMPTIME] Parallel task build + parallel tuning
  • [CI] Make Linear without bias for Regression
  • [Workflow] Fix bug in PR title checking workflow to allow xxx
  • [CI] Fix attention mask shapes in regression
  • [CI] Update github action version
  • [CI] Split the tests of operators into two folders to speed up CI
  • [PERF] Move parallel_k to the search space of hexcute matmul kernel
  • [BUG] Fix for torch==2.6.0. Attempt 2
  • Add parallel_k to the tuning space of matmul kernel
  • [Enhancement] Support cuBLAS for matmul_nt
  • [BUG] Fix for torch==2.6.0
  • [FEATURE] Postpone import torch
  • [COMPTIME] Preparation to nested parallelization
  • [HIP] Autoscheduler
  • [HIP] Support HIP for Hidet script
  • [FIX] Fix typo: 'compilaion' -> 'compilation'
  • [Enhancement] support important dynamic patterns in LLM
  • [HIP] Event and stream support for HIP runtime
  • [Test] Check hidet import time
  • [BUG] Change the broken test cases for scatter_ operators
  • [HIP] Hip runtime - memory and device
  • [CI] Add Linear to Regression
  • [Enhance] Extend search space of hexcute matmul & turn it on by default
  • [Workflow] Add a workflow for PR title checking
  • [CI] Modification of generative Regression tests
  • [FEATURE] Disable garbage collector during benchmarking
  • [OPS] Several changes in result of debugging torch.compile(Sampler.forward())
  • [Fix] add scale argument to sdpa function
  • [CI] Skip test_matmul_bf16_sm90 on non-hopper GPUs
  • [PERF] Change GPU clock frequency for benchmarking inside hidet
  • Kaihang/matmul bf16 wgmma swizzle
  • [DLPack] Remove workaround for bool type in dlpack
  • [Utils] Enable faulthandler in hidet to print trackback when segmentfault
  • [Tests/CI] Update tests and add a temperary AMD CI
  • [RUNTIME] Add missing try-catch guards
  • [vllm] Use example_inputs to determine shapes
  • Add support for torch.empty_like
  • [Bug] Fix an error in cudnn runtime calls
  • [BUG] Fixing a bug caused by parallel compilation with on-demand WGMMA instruction registration
  • [DataType] Add float8_e4m3 data type
  • [Stream] Change the impl of get_current_stream
  • [REFACT] Refactoring parallel compilation/tuning
  • [PERF] Reduce the execution time of import hidet
  • Kaihang/matmul f16 wgmma with swizzle layout
  • Change fast div transform log level to debug
  • [COMPSERVER] Make the same port of compilation server for server and client
  • [CI] Remove batch_matmul tests from Regression
  • [COMPSERVER] Speed up compilation server
  • [CI] Update docker image for Regression -> update to torch v2.5.1
  • [OPs] hardtanh inplace variant
  • [Enhancement] Add option and functionality to set torch stream as the current stream
  • [BUG] Several changes inspired by Release
  • add a flag for hexcute kernels
  • Update requirements.txt
  • [FEATURE] In compilation server clean memory after every compilation
  • [BUG] Fix for torch==2.6.0. Attempt 2
  • [BUG] Fix for torch==2.6.0
  • [FEATURE] Use permanent processes to handle fixed commits in the compilation server
  • [CI] chore: normalize ci runner naming
  • [BUG] Hot fix of comp server requirements.txt

Full Changelog: v0.5.0...v0.6.0