Hidet v0.6.0
What's Changed
- [Dependency] Move pygraphviz to dev dependency
- [Fix] Wrapup the release
- [CI] Use default runner instead of large-runners
- [Docs] Move release guide
- [Apps] Phase out the app-level abstraction
- [Security] Fix the security issue of using tempfile.mktemp()
- [BUG] Simplify symbol variables
- [CI] Add CI requirements installation
- [Bugfix] Add signal handler to clean up NCCL sync file
- [CI] Add permissions.contents: read to all Workflows
- [Fix] fix ci failure due to interface change of
mbarrier_try_wait - [PERF] Flattening batch dimensions for batched matmul
- [CI] Fix Nightly Workflow
- [CI] Using base Docker image for all functional tests
- [Feature][CuTe] add
mbarrieroperators in Hexcute - [FEATURE] Add flow graph visualization
- [Release][Wrapup] Prepare for the release
- [BUG][CI] Fix the CI failure caused by PyTorch version 2.7.0
- [Fix] fix matmul
- [Bug] Release the reserved memory in hidet for kv cache
- [Bugfix] Add rank information to flowgraph cache hash key
- [BUG] Fix memory error triggered while compiling model with cuBLAS
- [TESTs] Add more tests for
torch.compileandsplitop - [Bugfix] Change grid dimension to support large batch size
- [PERF] Enabled interval dispatch table by default
- [FEATURE] fp8_scaled_mm
- [Bugfix] Fixes and refactors to support Deepseek R1 compilation
- [Fix] Fix the mma config name for int8 tensor core
- [Graph Cache] Dump graph visual to cache when needed
- [HOTFIX] Hot fix for current CI fails
- [PERF] Speedup broadcast
- [PERF] Improving
Exprsimplification - [Package] Refactor dependency configuration
- [Dependents] Upgrade black to 25.1.0
- [Refactor] Refactor the property methods of data types
- [FEATURE] fp8_mm
- [Dependents] Remove the restriction on jinja2 version in docs building
- [Dependents] Remove gpt2 example with tensorflow dependency
- [PERF] Graph dispatch table optimization and support for nested shapes
- [Hopper] add wgmma inst in hexcute
- [Feature] Support FlowGraph to CompiledGraph cache
- [Fixbug] Fix a bug in the
instantiate_symbolspass - [Options] Add options to control two nvcc compilation flags
- [Enhancement] Add function to gather unsupported ops
- [Pass] Optimization for addition chain
- [Perf] Support fused_moe_awq_gptq
- [Feature] finalize warp specialization
- [CI][Fix] trap heartbeat logging to ensure it exits if build-docs fails
- [Bugfix] Cast tensor shape to int64 when computing tensor nbytes
- [Codegen][Runtime] Add try-catch to protect the public function
- [PERF] Implement identity op
- [FIX][BUG] remove assign stmt in code generation in cute
- [CI][Fix] trap heartbeat logging to ensure it exits for any failed tests
- [IR][Runtime] Support pointer type symbolic variable
- [BUG] Fix the complex expr shows up in shape
- [Hopper] add a cost model
- [CI][Fix] use same fix for build-docs in all test workflows
- [BUG] Fix cuBLAS error occurred when serving
Llama-3.1-8Bmodel - [BUG] Fix
torch.nn.functional.group_normimplementation - [Utils] Add utility function to launch compute-sanitizer
- [FEATURE] Various Operator Support + Bug Fixes
- [Feature] add wgmma fence operand
- [Hidet Script] Add support for lambda and fix assignment issue
- [BUG] fix flatten tensor index pass
- [Wheel] Fix a bug when we install the package with
pip install . - [Fix][Primitives] fix cp_async_bulk_tensor_s2g
- [BUG] Fix broken mbarrier CI test
- [PERF] Add graph rewrite rule:
Transpose(B) + Matmul -> MatmulNT - [FEATURE] Add support for warp specialization context managers
- [Docs] Update the copyright year
- [Fix] fix reduce test failure on H100.
- [CI][PR Title] Allow multiple categories in PR title
- [CI] fix test workflow gpu params for push event github action
- [CI] Use hidet api in benchmarking for Llama MLP layer
- [CI] run Tests workflow on l4, h100 by default, only require l4 success
- [CI] add extra logging to build-docs, set lower priority for make step
- [PERF] Minimized version of dispatch table with options
- [Operators] Add support for
operator.floordiv - [Fixbug] Fix a bug in runtime that does not update workspace size
- [FIX] Fix deploy doc workflows
- [BUG] Avoid comp of
viewduring callTensor.torchfor fp8 tensor - [BUG] Fix choosing stream in
hidet.Event.record() - [PROJECT] Create
pyproject.toml - [FEATURE] Automatically deploy docs to website
- [FEATURE] Add fp8 wgmma, mma support
- [HIP] Add HipGraph class and related HIP graph functionalities
- [BUGFIX] Fix vLLM backend parallel build failure
- [FEATURE] Add fp8 (e4m3,e5m2) support
- [AMD] mma for float16 and float32 on AMD GPU (gfx90a/MI200s)
- [FEATURE] Increase the accuracy of benchmarking of small kernels
- [HIP] Switch to using HIP Python for HIP runtime wrappers
- [Bug] Fix the way to lint the python source code
- [AMD] Support batch matmul with matrix core
- [FEATURE] Tensor
viewoperator - [CI] Use torch 2.6.0 for Perf Tests
- [AMD] mfma instructions for gfx90a
- [FEATURE] In compilation server clean memory after every compilation
- [FEATURE] Use permanent processes to handle fixed commits in the compilation server
- [HIP] Support f32 matmul and llama end to end example
- [Format] Show progress bar to formatting process
- [HIP] Resnet end to end example
- [COMPTIME] Parallel task build + parallel tuning
- [CI] Make
Linearwithoutbiasfor Regression - [Workflow] Fix bug in PR title checking workflow to allow
xxx - [CI] Fix attention mask shapes in regression
- [CI] Update github action version
- [CI] Split the tests of operators into two folders to speed up CI
- [PERF] Move parallel_k to the search space of hexcute matmul kernel
- [BUG] Fix for torch==2.6.0. Attempt 2
- Add parallel_k to the tuning space of matmul kernel
- [Enhancement] Support cuBLAS for matmul_nt
- [BUG] Fix for torch==2.6.0
- [FEATURE] Postpone import torch
- [COMPTIME] Preparation to nested parallelization
- [HIP] Autoscheduler
- [HIP] Support HIP for Hidet script
- [FIX] Fix typo: 'compilaion' -> 'compilation'
- [Enhancement] support important dynamic patterns in LLM
- [HIP] Event and stream support for HIP runtime
- [Test] Check hidet import time
- [BUG] Change the broken test cases for scatter_ operators
- [HIP] Hip runtime - memory and device
- [CI] Add
Linearto Regression - [Enhance] Extend search space of hexcute matmul & turn it on by default
- [Workflow] Add a workflow for PR title checking
- [CI] Modification of generative Regression tests
- [FEATURE] Disable garbage collector during benchmarking
- [OPS] Several changes in result of debugging
torch.compile(Sampler.forward()) - [Fix] add scale argument to sdpa function
- [CI] Skip
test_matmul_bf16_sm90on non-hopper GPUs - [PERF] Change GPU clock frequency for benchmarking inside hidet
- Kaihang/matmul bf16 wgmma swizzle
- [DLPack] Remove workaround for bool type in dlpack
- [Utils] Enable faulthandler in hidet to print trackback when segmentfault
- [Tests/CI] Update tests and add a temperary AMD CI
- [RUNTIME] Add missing try-catch guards
- [vllm] Use
example_inputsto determine shapes - Add support for torch.empty_like
- [Bug] Fix an error in cudnn runtime calls
- [BUG] Fixing a bug caused by parallel compilation with on-demand WGMMA instruction registration
- [DataType] Add
float8_e4m3data type - [Stream] Change the impl of get_current_stream
- [REFACT] Refactoring parallel compilation/tuning
- [PERF] Reduce the execution time of
import hidet - Kaihang/matmul f16 wgmma with swizzle layout
- Change fast div transform log level to debug
- [COMPSERVER] Make the same port of compilation server for server and client
- [CI] Remove
batch_matmultests from Regression - [COMPSERVER] Speed up compilation server
- [CI] Update docker image for Regression -> update to torch v2.5.1
- [OPs] hardtanh inplace variant
- [Enhancement] Add option and functionality to set torch stream as the current stream
- [BUG] Several changes inspired by Release
- add a flag for hexcute kernels
- Update requirements.txt
- [FEATURE] In compilation server clean memory after every compilation
- [BUG] Fix for torch==2.6.0. Attempt 2
- [BUG] Fix for torch==2.6.0
- [FEATURE] Use permanent processes to handle fixed commits in the compilation server
- [CI] chore: normalize ci runner naming
- [BUG] Hot fix of comp server requirements.txt
Full Changelog: v0.5.0...v0.6.0