What's Changed
- [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path by @LeiWang1999 in #1385
- [BugFix] Fix split kernel layout bug of GQA decode by @tzj-fxz in #1386
- [Feat] Add better repr print for Layout and Fragment by @kurisu6912 in #1392
- [Doc] Logging docs for Tilelang/TVM by @SiriusNEO in #1395
- [Enhancement] Refactor inflight computing to support dynamic pipeline extents by @LeiWang1999 in #1399
- [AMD] Fix 3 bugs when build docker on amd mi3x gpu by @danielhua23 in #1401
- [Typo] Fix tilelang link in README.md by @senlyu163 in #1402
- [Dependency] Update apache-tvm-ffi version to >=0.1.2 by @LeiWang1999 in #1400
- [AMD] Enable FA2 fwd on AMD MI300X by @danielhua23 in #1406
- [Typo] fix typo for SM120 by @Cunxiao2002 in #1408
- [Doc] Minor documentation update by @LeiWang1999 in #1410
- [Dependency] Add torch-c-dlpack-ext to project requirements by @LeiWang1999 in #1403
- [Bugfix] Alloc
T.make_tensornot on the top of prim_func by @LeiWang1999 in #1412 - [Enhancement] Introduce
T.__ldgby @LeiWang1999 in #1414 - [Enhancement] Improve vectorization invariant check by @LJC00118 in #1398
- [Lint] Phaseout Yapf format and embrace ruff format by @LeiWang1999 in #1417
- [Atomic] Use ptr for atomicAdd dst instead of reference by @LeiWang1999 in #1425
- [CUDA] Add read-only parameter annotation for CUDA codegen by @LeiWang1999 in #1416
- [Refactor] Phase out the primitives folder since its design has been merged into tileop by @LeiWang1999 in #1429
- [CI]: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1431
- [CI]: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1432
- [Bugfix] Convey
compile_flagsto ffi compilation path with pass_configs by @LeiWang1999 in #1434 - [Enhancement] Improve buffer usage tracking in MakePackedAPI by @LeiWang1999 in #1435
- [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice by @SiriusNEO in #1405
- [Enhancement] Include PrimFunc name in memory cache logs for better ebugging by @LeiWang1999 in #1437
- [CI] Update lint dependencies and fix lint on trunk by @XuehaiPan in #1433
- [Enhancement] Refactor vectorization checks in loop_vectorize by @LeiWang1999 in #1440
- [Enhancement] Implement vectorized FP8 to FP32 cast by @LJC00118 in #1438
- [Feature] Support region as input of T.cumsum by @Dayuxiaoshui in #1426
- [Fix] Fix analyzer bind conflicting bug in #1442 by @kurisu6912 in #1446
- [Refactor] Reduce direct dependency on PyTorch due to its limited type support by @LeiWang1999 in #1444
- [Refactor] Use
pytest.mark.parameterizeto speedup parallel testing by @kurisu6912 in #1447 - [Docs] Improve installation instructions for developers by @SiriusNEO in #1450
- [Feat] Integrate Z3 in TVM Arith Analyzer by @kurisu6912 in #1367
- [Bugfix] Improve autotune from elementwise_add function in examples by @senlyu163 in #1445
- [Language] Introduce
T.annotate_restrict_buffersby @LeiWang1999 in #1428 - [Analyzer] Require loop extent > 0 when entering loop (#1012) by @kurisu6912 in #1451
- [BugFix] Update CI to ROCm-7.1 by @Gongen-Ali in #1449
- [Enhancement] Update examples and tests for improved type handling functionality by @LeiWang1999 in #1448
- [Issue Template] Enable blank issues in GitHub issue template by @LeiWang1999 in #1453
- [CI] Moved the clang-tidy step to after pip install by @LeiWang1999 in #1456
- [Bug] Fix tvm build script when patchelf is not found by @kurisu6912 in #1459
- [Analyzer] Fix floordiv & floormod bug in z3 prover by @kurisu6912 in #1458
- [Cache] Rename sparse compress cache directory by @LeiWang1999 in #1460
- [Language]Adds a random number generation capability through curand_kernel by @silentCoder-dev in #1461
- remove unused duplicated type check by @sgjzfzzf in #1462
- feat(cutedsl): add CuTeDSL backend by @lucifer1004 in #1421
- [Refactor] Rename test for curand & add triton baseline in
test_tilelang_language_rand.pyby @silentCoder-dev in #1464 - [ArgBinder] Enhance shape variable handling and assertions by @LeiWang1999 in #1467
- [Language] Make TL scripts friendly to Python syntax highlights by @SiriusNEO in #1466
- [Refactor] Remove triton dependence in testing & move triton baseline into examples by @silentCoder-dev in #1470
- [Language] Enhance T.dtype.as_torch conversion for compatibility by @LeiWang1999 in #1473
- [News] update with latest news by @LeiWang1999 in #1475
- [Enhancement] Use static Z3 context by @LeiWang1999 in #1482
- [Enhancement] Enhance let binding handling in layout inference and warp specialized pass by @LeiWang1999 in #1484
- [Refactor] Phaseout PassConfig
kDisableDynamicTailSplitandkDynamicAlignmentas they are legacy by @LeiWang1999 in #1486 - [Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator by @LeiWang1999 in #1491
- [CI] Add preformance regression test script by @xwhzz in #1489
- Pin nvidia-cutlass-dsl to 4.3.3 by @lucifer1004 in #1497
- [Language] Remove ConstIf Frame for Better Meta-Programming by @kurisu6912 in #1496
- [Bugfix][CI] Fix concurrency bug in regression test workflow by @xwhzz in #1500
- [Refactor] Phaseout legacy
alloc_localstatement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in #1495 - [Enhancement] Optimize MHA varlen fwd and support autotune by @Rachmanino in #1499
- [Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type by @LJC00118 in #1474
- [Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled by @LeiWang1999 in #1502
- Update cutedsl docs and version check by @lucifer1004 in #1503
- [Misc] configure pymarkdown by @lucifer1004 in #1505
- [Language] Fix gemm syntax highlight by @SiriusNEO in #1476
- [Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi by @kurisu6912 in #1511
- [Refactor] Phaseout execution_backend
ctypesby @LeiWang1999 in #1510 - [Testing] Add Memory Leak Test by @kurisu6912 in #1516
- [Refactor] Support auto swizzling for tma store and phaseout related layout annotations by @LeiWang1999 in #1509
- [CuTeDSL][Fix] thread safety + context safety by @lucifer1004 in #1513
- [BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI by @tzj-fxz in #1515
- [Cleanup] Remove unnecessary macros in tilelang examples by @Rachmanino in #1514
- Fix ramp_lanes calculation in CUDA codegen by @LJC00118 in #1518
- [Misc] add env for default target/backend/verbose by @lucifer1004 in #1512
- [Dtype] Improve host codegen handling for subtype by @LeiWang1999 in #1517
- [Bugfix] Fallback to a Linear Layout instead of raising errors by @LeiWang1999 in #1521
- Use
TargetIsCudafor all cuda target by @oraluben in #1522 - Fix fp4 pointer arithmetic in CUDA codegen by @LJC00118 in #1524
- [Enhancement] Improve GitHub Actions permissions check and refine performance regression testing by @xwhzz in #1519
- [Release] Bump version into 0.1.7.post1 by @LeiWang1999 in #1506
- [Pipeline] Refactor buffer allocation in Inject Pipeline Pass by @LeiWang1999 in #1525
- [Dev] Fix when build local version with isolated build by @oraluben in #1487
- [Bugfix] Skip stride check for subtype by @LeiWang1999 in #1531
- [Lint] Enable whitespace and permission bit hooks by @XuehaiPan in #1439
- [Enhancement][Tool] Tree-style pretty ASTPrinter by @SiriusNEO in #1468
- [Fix] Add support for non-var complement arithmetic computation (#1374) by @kurisu6912 in #1533
- [BugFix] Complete vectorized loading for common dtypes by @SiriusNEO in #1536
- [Compat] Add CUDA version check for __nv_fp8_e8m0 type by @LeiWang1999 in #1537
- [BugFix] Fix bugs of varlen attention forward examples caused by
S_q != S_kvby @hukongyi in #1530 - [Bug] Fix hanging from reduction on sm120 by @PannenetsF in #1540
- [example] use T.dynamic instead of tvm.te.var by @botbw in #1538
- [Enhancement] Refactor KernelCache to use inheritance-based design by @sgjzfzzf in #1483
- [Bugfix] Avoid considering
local.varbuffer aslocalby @LeiWang1999 in #1541 - [Bugfix] Fix of
T.Fillfor local.var by @LeiWang1999 in #1543 - [Z3] Change z3 timeout to rlimit for determistic prove behavior by @kurisu6912 in #1542
- [Feat] Adapt gemm v2 for cutedsl backend by @lucifer1004 in #1544
- [Enhancement] Support larger
Hin deepseek sparse mla backward via split-H by @Rachmanino in #1548 - [Bugfix] Fix regression test to use installed package instead of source directory by @xwhzz in #1550
- [Refactor] Introduce layout annotations for
ParallelOPNodeandCopyNodeby @LeiWang1999 in #1539 - [Script] Provide regression test script to help benchmark regression in local env by @LeiWang1999 in #1551
- [Typing] Update Kernel signature and add type hints for buffer operations by @clouds56 in #1545
- [CI]: Bump actions/upload-artifact from 4 to 6 by @dependabot[bot] in #1555
- [Refactor] Use cuda capability from torch to be more generic by @oraluben in #1557
- [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #1556
- [Host] Provide post process to customize host code and enhance nullable check by @LeiWang1999 in #1562
- [Release] Build tilelang against CUDA 13.1 in CI by @oraluben in #1532
- [LazyJIT] Move Type Annotations to Function Body by @kurisu6912 in #1480
- [bugfix] fix missing clear_accum logic for gemm_sp_v2 by @botbw in #1563
- [Misc] Remove unused
tl_pipeline_sync. by @c8ef in #1566 - [Refactor] Improve scalarization handling in Pass VectorizeLoop by @LeiWang1999 in #1565
- [Refactor] Simplify do_bench calls by using default warmup and rep parameters by @LeiWang1999 in #1568
- [CI] Refactor PR regression test job conditions by @xwhzz in #1569
- [Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by @LeiWang1999 in #1559
- [Refactor] Enhance deterministic ordering in shared memory allocation merge. by @LeiWang1999 in #1570
- [Enhancement] Improve equality checks in layout nodes and fragment validation by @LeiWang1999 in #1573
- [Feature] add kUseCooperativeLaunch tag for tvm_ffi by @silentCoder-dev in #1572
- [Refactor] Remove unnecessary logging configuration in Analyzer.py by @LeiWang1999 in #1574
- [Release] Bump version to 0.1.7.post2 by @LeiWang1999 in #1575
- [BugFix] Change default rounding mode for fp4 conversions by @LJC00118 in #1580
- [CI] Add CUDA-aware pytest scheduler + auto workers by @LeiWang1999 in #1584
- [Enhancement] Improve performance regression output with timing and streaming by @xwhzz in #1585
- [Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter by @haok1402 in #1589
- [BugFix] Add PrimExpr substitution support for AttrStmt nodes by @LJC00118 in #1583
- [BugFix] fix tcgen5mma example by @Rachmanino in #1577
- [Refactor] Use access_ptr instead of buffer and offsets for cp async params by @LeiWang1999 in #1590
- [Layout] Support annotating loop layout in frontend by @LeiWang1999 in #1579
- [Typo] Rename loop layout annotation test by @LeiWang1999 in #1596
- [Fix] Add register to read A ptr in
test_tilelang_language_cooperative.pyby @silentCoder-dev in #1593 - [Feat] PDL Support by @w169q169 in #1494
- [Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype by @LeiWang1999 in #1599
- [Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) by @lucifer1004 in #1597
- [Release] Fix race condition when publishing by @oraluben in #1578
- Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 by @LJC00118 in #1600
- [Enhancement][AMD] Add preshuffle fp8 gemm example on amd. by @Gongen-Ali in #1605
- [Bugfix] Mangle Single Precision Mathematical Functions of cuda math api by @silentCoder-dev in #1602
- [Bugfix] Open Rocm ci test and fix some bugs. by @Gongen-Ali in #1443
- [Feature] Add more curand operations & support vectorization by @silentCoder-dev in #1582
- [Enhancement] Allow
import tilelangon CPU-only machines without CUDA libraries by @XuehaiPan in #1481 - [BugFix] Add pre-commit to requirements-dev.txt by @asaadkhaja99 in #1611
- [BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop by @SiriusNEO in #1607
- [Feat] Add strong checker to detect data racing in T.Parallel by @kurisu6912 in #1615
- [Feature] add
T.sync_warp&T.shfl_sync; change extern pdl into intrin by @silentCoder-dev in #1614 - [RaceChecker] RaceChecker report warning rather than error for backward compatibility by @kurisu6912 in #1620
- [BugFix] Fix
ForwardRefusage in v2 frontend (#1619) by @kurisu6912 in #1621 - [Refactor] Move
ConstrVisitortosrc/transform/common/constr_visitor.hfor reuse by @silentCoder-dev in #1622 - [Feat] Improve
T.reduce_absmaxto use less abs call by @kurisu6912 in #1626 - [Bugfix] Do not consider local.var as local buffer during LowerTileOP by @LeiWang1999 in #1628
- [Feature] Add hoist_broadcast_values pass by @silentCoder-dev in #1606
- [Enhancement][CUDA] Support
nvidia-cuda-nvccasnvccby @clouds56 in #1528 - [Bugfix] Fallback into full region when dynamic buffer read region cannot be proved by @LeiWang1999 in #1618
- [Feat] Allow print macro call stack in device assert by @kurisu6912 in #1616
- [BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with
k_dim==4and open rocm-ci for gemmsr by @benenzhu in #1627 - [Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 by @hammersam in #1636
- [Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt by @LeiWang1999 in #1638
- [Refactor][CI] Reduce sparse related test time by @LeiWang1999 in #1637
- [Refactor] Unify @jit and @lazy_jit into a single @jit decorator by @LeiWang1999 in #1632
- [Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen by @LeiWang1999 in #1650
- [Bugfix] reverted unexpected tvm changes by @LeiWang1999 in #1651
- [Bugfix] reverted unexpected tvm changes by @LeiWang1999 in #1652
- [Refactor] Move dtypes.py from eager to language and add bits/bytes properties by @LeiWang1999 in #1646
- [Feat] Allow dangling producer in wasp pipeline planning (#1263) by @kurisu6912 in #1647
- [bugfix] fix smem alloc for single warp reduce by @botbw in #1643
- [Example] Add attention sink varlen examples by @Rachmanino in #1645
- [ASTPrinter] Fix IfThenElse printing and some format problems by @SiriusNEO in #1640
- [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1610
- [Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides by @Rachmanino in #1649
- [Example] Remove redundant T.copy in
examples/deepseek_v32/sparse_mla_fwd.pyby @GoldenStain in #1634 - [CUDA] Introduce simulated load/store 256bits access for CUDA compatibility by @LeiWang1999 in #1656
- [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case by @LeiWang1999 in #1654
- [Bugfix] Fix missing annotations for default CallNode Visitor by @LeiWang1999 in #1659
- [Clean] Remove unnecessary debug print by @LeiWang1999 in #1661
- [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies by @LeiWang1999 in #1657
- [Refactor] Improve CallNode handling to include annotations in various operations by @LeiWang1999 in #1663
- [EagerJIT] Add Support for Parameter Only Kernel Compilation by @kurisu6912 in #1664
- [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program by @KEKE046 in #1639
- [Feature] Support
cp.reduce.async.bulk.tensorby @Rachmanino in #1667 - chore: update CI cutedsl version to 4.3.5 by @lucifer1004 in #1665
- [CUDA] Enhance Broadcast Codegen for Symbolic Value by @LeiWang1999 in #1669
- [EagerJIT] Fix bug in handling of positional arguments by @kurisu6912 in #1675
- [Feature] Reimplement
ThreadsyncwithConstrVisitorby @silentCoder-dev in #1631 - [Clean][Refactor] Phaseout Legacy Pass
ParallelLoopTransformerby @LeiWang1999 in #1672 - [Feature] Atomic Reduction Operations and Vectorization Enhancement by @LeiWang1999 in #1676
- [Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass by @LeiWang1999 in #1677
- [Bugfix] Relax region analysis for complex expression by @LeiWang1999 in #1679
- [Example] Add example for mHC inference kernels. by @Elevator14B in #1684
- [Analyzer] Fix missing assume in tvm analyzer by @kurisu6912 in #1680
- Refactor: Use centralized do_bench from tilelang.profiler by @LeiWang1999 in #1670
- [Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization by @LeiWang1999 in #1644
- [Release] Bump Version into v0.1.7.post3 by @LeiWang1999 in #1685
- [Release] Fix release wheels by @oraluben in #1687
- [BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug by @xiuhu17 in #1588
- [Bugfix] Reorganize pass for
thread_syncby @silentCoder-dev in #1682 - [BugFix] fix warning on deepseek_v32 topk_selector.py by @sgjzfzzf in #1681
- [tvm-ffi] Enable tvm-ffi for metal backend by @oraluben in #1289
- [Analyzer] Fix missing assume in tvm analyzer by @LJC00118 in #1695
- [Chore] Use python-side control flow keywords in examples for consistency by @Rachmanino in #1692
- [Bugfix][Refactor] Always disable light storage reuse by @LeiWang1999 in #1691
- [Enhancement] Log warnings for OOB acceses to non-global buffers by @SiriusNEO in #1693
- Enhance loop vectorization logic for CallNode handling by @LeiWang1999 in #1696
- [BugFix] Fix JITKernel export_library bug by @chengyupku in #1699
- [Enhancement] Handle vectorizable calls by @LeiWang1999 in #1700
- [BugFix] Fix unsafe visit else case under WarpSpecializationScope by @SiriusNEO in #1702
- [Enhancement] Use
cute::elect_one_sync()for slightly better performance by @Rachmanino in #1703 - [Enhancement] Remove
RewriteUnsafeSelectPass by @LJC00118 in #1705 - [BugFix] Corrected when proving loop layout contains a fragment buffer layout by @LeiWang1999 in #1708
- [Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout by @LeiWang1999 in #1709
- [BugFix] Add int64_t support for AtomicAdd by @LeiWang1999 in #1716
- [Refactor] Introduce GemmInst enumeration and update warp partitioning logic by @Rachmanino in #1707
- [Refactor] Phaseout unnecessary checks for pr #1707 by @LeiWang1999 in #1721
- [Refactor] re-implement vector subtype and its access method by @LeiWang1999 in #1722
- [EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT (#1690) by @kurisu6912 in #1694
- [Enhancement] Legalize subtype access by @LeiWang1999 in #1724
- [EagerJIT] Enhance auto inference of lazyjit and eager jit by @kurisu6912 in #1704
- [Refactor] Enhance variable substitution in device function generation by @LeiWang1999 in #1723
- [Bugfix] Fix incorrect alignment of vectorized subtype by @LeiWang1999 in #1726
- [Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) by @LeiWang1999 in #1717
- [Refactor] Remove external buffer conflict check in pipeline injection by @LeiWang1999 in #1727
- [Refactor] Relocate layout transformation of
ptx_stmatrixby @LeiWang1999 in #1689 - [AMD] Add MI350/MI355 FP8 support by @hubertlu-tw in #1718
- [Bugfix] revert incorrect fast path for parallel layout inference by @LeiWang1999 in #1730
- [Example] Add KDA algorithm implementation in tilelang by @wfloveiu in #1660
- [Feature] Support E8M0 related type conversion and vectorized cast by @SiriusNEO in #1731
- [BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 by @kurisu6912 in #1735
- Add swizzle layout detection and automatic merging for layout conflicts by @LeiWang1999 in #1736
- [Bugfix] Handle offset handling for subtype ptr by @LeiWang1999 in #1738
- [EagerJIT] Allow dummy parameter in jit kernel by @kurisu6912 in #1737
- [Feature] Add build date to version metadata by @LeiWang1999 in #1742
- [BugFix] Fix FP4 related vectorized cast by @chaospointer in #1741
- [Refactor] Disable Predicated LDG PTX Lowering by default by @LeiWang1999 in #1739
- [Layout] Fix Layout Bugs in Parallel and Reduce by @kurisu6912 in #1713
- [fix]: fix deepseek_mla amd example and add aiter mla compare test by @ZiguanWang in #1740
- [Refactor] Enhance
T.alloc_barrierwith new features and deprecate legacy mbarrier related intrinsics by @Rachmanino in #1733 - [BugFix] Fix several bugs in CodeGen for CuTeDSL backend by @Rachmanino in #1746
- Update import for compare_tensors from test_utils_kda by @pmixer in #1748
- [Lint] Remove diff arguments in Ruff and sync some versions by @SiriusNEO in #1751
- [Refactor] Rename EagerJIT examples to avoid confusion by @SiriusNEO in #1750
- [AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 by @hubertlu-tw in #1743
- [Feature] Support message-only debug print by @Rachmanino in #1755
- [EagerJIT] Update README example to eager jit by @kurisu6912 in #1752
- [BugFix] Stride check and fix for tensors with zero-stride argument by @tzj-fxz in #1749
- [BugFix] Always build guard in loop partitioning to prevent out-of-bounds access by @LeiWang1999 in #1756
- [Tool] Add tool to print fragment in thread value view by @kurisu6912 in #1759
- [Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking by @LeiWang1999 in #1753
- [ThreadSync] Use Z3 for constraint equivalence checking by @LeiWang1999 in #1760
- [Feature] Implement LoopUnswitching Pass by @chengyupku in #1747
- [Chore] Remove unnecessary log from z3 by @Rachmanino in #1763
- [Bugfix] Revert the initial value of Z3 SetRLimit by @LeiWang1999 in #1765
- [Feature] Enhance Loop Unswitching with Let Binding and Condition Handling by @LeiWang1999 in #1766
- [Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass by @LeiWang1999 in #1767
- [Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass by @LeiWang1999 in #1769
- [Fix] Change ue8m0 default round mode to cudaRoundPosInf by @SiriusNEO in #1770
- [Feature] Support tcgen5mma lowering for
.kind::i8by @Rachmanino in #1764 - [Refactor] Unify the usage of cast-related operators by @SiriusNEO in #1757
- [Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations by @LeiWang1999 in #1776
- [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1775
- [Refactor] Improve type annotations and reduce some lint errors in frontend by @SiriusNEO in #1777
- Update TVM: fix select/if_then_else out-of-bounds access by @LeiWang1999 in #1783
- [Feature] Add fully replicated layout interface in annotation layout by @tzj-fxz in #1772
- [Example][BugFix] Fix arguements override in deepseek_v32 topk_selector by @ljwljwljwljw in #1784
- [BugFix] Fix reduce_sum with clear=False not accumulating correctly by @ShaobinChen-AH in #1778
- fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter by @Coloured-glaze in #1786
- [Enhancement] Enhance register vectorize inference by @LeiWang1999 in #1785
- [Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read by @LeiWang1999 in #1781
- [Fix] cython 3.0 generates incorrect code for python stable api by @oraluben in #1789
- [BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly by @xwhzz in #1794
- [ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis by @LeiWang1999 in #1795
- [Feature] Add option to disable out-of-bound access warnings in safe memory access legalization by @kurisu6912 in #1797
- [Docs] Add Python Compatibility document of TileLang by @LeiWang1999 in #1745
- [Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils by @LeiWang1999 in #1779
- [Feature] Support passing PrimExpr value in tile-level atomic operation by @SiriusNEO in #1796
- [Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined by @ljwljwljwljw in #1799
- [BugFix] Missing Recursive Loop Var Checking in Loop Unswitching by @kurisu6912 in #1801
- Fix a 3.9 issue. add
_typing.pyto dist check by @oraluben in #1803 - [Docs][Puzzles] Add TileLang puzzles in README by @SiriusNEO in #1806
- [Docs] Hotfix wrong link by @SiriusNEO in #1807
- [Enhancement] Improve plot_layout visualization for Layouts by @LeiWang1999 in #1811
- [Feat] profiler support cudagraph backend by @cscyuge in #1658
- Handle staled autotune state with tvm-ffi adapter. by @haok1402 in #1812
- [BugFix] LoopUnswitching: gate non-trivial else behind PassConfig by @LeiWang1999 in #1816
- [Release] Update dependencies to resolve several issues by @oraluben in #1817
- [BugFix] Fix fp16 annotate_l2_hit_ratio host stub compilation (issue #1810) by @LeiWang1999 in #1818
- [Bugfix] Remove mistaken coalesced_width parameter in regression test of fusedmoe kernel by @xwhzz in #1820
- [Release] Add build for python 3.14t by @oraluben in #1805
- Fix: treat kParallel as serial when vectorizing by @LeiWang1999 in #1819
- [Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels) by @LeiWang1999 in #1821
- [Analyzer] Add SideEffect Checking in ConstIntBound Analyzer by @kurisu6912 in #1824
- [Bugfix] Fix ast builder error for
value -= 1by @LeiWang1999 in #1825 - [Release][Build] Merge libtilelang and libtilelang_modules by @oraluben in #1814
- [Bugfix] Fix threadIdx variable lookup by thread_tag instead of position in ThreadSync by @LeiWang1999 in #1829
- [Docs] Update nightly build installation instructions in README and Installation guide by @xwhzz in #1830
- [BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier injection by @ColmaLiu in #1828
- [Refactor] Treat
local.varaslocalbuffers when deciding vectorization for stable actions by @LeiWang1999 in #1835 - Fix tilelang global load/store template by @LJC00118 in #1837
- [Refactor] Introduce
T.access_ofto combineT.address_ofandaccess_ptrby @LeiWang1999 in #1827 - [CUDA][Feature] Add packed FP32x2 math intrinsics and auto vectorized support by @LeiWang1999 in #1839
- [Example][BugFix] 1SM GEMM example on Blackwell and fix handling of
mbarby @Rachmanino in #1774 - [Feature] Hierarchical reduction and warp reduction intrinsics support by @tzj-fxz in #1762
- [Dist][Release] Use one wheel for different CUDA version by @oraluben in #1826
- [Enhancement] Optimize templates for half/bfloat16 by @LJC00118 in #1845
- ThreadSync: avoid barriers between atomic ops by @LeiWang1999 in #1852
- [BugFix] Fix eager mode where there is no tensor args by @Rachmanino in #1851
- [AMD] Fix bugs about AMD FA kernel by @danielhua23 in #1701
- Add an example: mHC residual projection backward by @Da1sypetals in #1758
- [Release] Bump version into v0.1.8 by @LeiWang1999 in #1853
New Contributors
- @danielhua23 made their first contribution in #1401
- @senlyu163 made their first contribution in #1402
- @Dayuxiaoshui made their first contribution in #1426
- @silentCoder-dev made their first contribution in #1461
- @sgjzfzzf made their first contribution in #1462
- @hukongyi made their first contribution in #1530
- @clouds56 made their first contribution in #1545
- @c8ef made their first contribution in #1566
- @haok1402 made their first contribution in #1589
- @w169q169 made their first contribution in #1494
- @asaadkhaja99 made their first contribution in #1611
- @hammersam made their first contribution in #1636
- @GoldenStain made their first contribution in #1634
- @KEKE046 made their first contribution in #1639
- @xiuhu17 made their first contribution in #1588
- @hubertlu-tw made their first contribution in #1718
- @wfloveiu made their first contribution in #1660
- @chaospointer made their first contribution in #1741
- @ZiguanWang made their first contribution in #1740
- @pmixer made their first contribution in #1748
- @ljwljwljwljw made their first contribution in #1784
- @ShaobinChen-AH made their first contribution in #1778
- @Coloured-glaze made their first contribution in #1786
- @cscyuge made their first contribution in #1658
- @ColmaLiu made their first contribution in #1828
- @Da1sypetals made their first contribution in #1758
Full Changelog: v0.1.7...v0.1.8