Skip to content

v0.1.8

Latest

Choose a tag to compare

@LeiWang1999 LeiWang1999 released this 16 Feb 14:05
· 40 commits to main since this release
41b2552

What's Changed

  • [Bugfix][Build] Update CMake configuration to remove project root injection for sys.path by @LeiWang1999 in #1385
  • [BugFix] Fix split kernel layout bug of GQA decode by @tzj-fxz in #1386
  • [Feat] Add better repr print for Layout and Fragment by @kurisu6912 in #1392
  • [Doc] Logging docs for Tilelang/TVM by @SiriusNEO in #1395
  • [Enhancement] Refactor inflight computing to support dynamic pipeline extents by @LeiWang1999 in #1399
  • [AMD] Fix 3 bugs when build docker on amd mi3x gpu by @danielhua23 in #1401
  • [Typo] Fix tilelang link in README.md by @senlyu163 in #1402
  • [Dependency] Update apache-tvm-ffi version to >=0.1.2 by @LeiWang1999 in #1400
  • [AMD] Enable FA2 fwd on AMD MI300X by @danielhua23 in #1406
  • [Typo] fix typo for SM120 by @Cunxiao2002 in #1408
  • [Doc] Minor documentation update by @LeiWang1999 in #1410
  • [Dependency] Add torch-c-dlpack-ext to project requirements by @LeiWang1999 in #1403
  • [Bugfix] Alloc T.make_tensor not on the top of prim_func by @LeiWang1999 in #1412
  • [Enhancement] Introduce T.__ldg by @LeiWang1999 in #1414
  • [Enhancement] Improve vectorization invariant check by @LJC00118 in #1398
  • [Lint] Phaseout Yapf format and embrace ruff format by @LeiWang1999 in #1417
  • [Atomic] Use ptr for atomicAdd dst instead of reference by @LeiWang1999 in #1425
  • [CUDA] Add read-only parameter annotation for CUDA codegen by @LeiWang1999 in #1416
  • [Refactor] Phase out the primitives folder since its design has been merged into tileop by @LeiWang1999 in #1429
  • [CI]: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1431
  • [CI]: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1432
  • [Bugfix] Convey compile_flags to ffi compilation path with pass_configs by @LeiWang1999 in #1434
  • [Enhancement] Improve buffer usage tracking in MakePackedAPI by @LeiWang1999 in #1435
  • [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice by @SiriusNEO in #1405
  • [Enhancement] Include PrimFunc name in memory cache logs for better ebugging by @LeiWang1999 in #1437
  • [CI] Update lint dependencies and fix lint on trunk by @XuehaiPan in #1433
  • [Enhancement] Refactor vectorization checks in loop_vectorize by @LeiWang1999 in #1440
  • [Enhancement] Implement vectorized FP8 to FP32 cast by @LJC00118 in #1438
  • [Feature] Support region as input of T.cumsum by @Dayuxiaoshui in #1426
  • [Fix] Fix analyzer bind conflicting bug in #1442 by @kurisu6912 in #1446
  • [Refactor] Reduce direct dependency on PyTorch due to its limited type support by @LeiWang1999 in #1444
  • [Refactor] Use pytest.mark.parameterize to speedup parallel testing by @kurisu6912 in #1447
  • [Docs] Improve installation instructions for developers by @SiriusNEO in #1450
  • [Feat] Integrate Z3 in TVM Arith Analyzer by @kurisu6912 in #1367
  • [Bugfix] Improve autotune from elementwise_add function in examples by @senlyu163 in #1445
  • [Language] Introduce T.annotate_restrict_buffers by @LeiWang1999 in #1428
  • [Analyzer] Require loop extent > 0 when entering loop (#1012) by @kurisu6912 in #1451
  • [BugFix] Update CI to ROCm-7.1 by @Gongen-Ali in #1449
  • [Enhancement] Update examples and tests for improved type handling functionality by @LeiWang1999 in #1448
  • [Issue Template] Enable blank issues in GitHub issue template by @LeiWang1999 in #1453
  • [CI] Moved the clang-tidy step to after pip install by @LeiWang1999 in #1456
  • [Bug] Fix tvm build script when patchelf is not found by @kurisu6912 in #1459
  • [Analyzer] Fix floordiv & floormod bug in z3 prover by @kurisu6912 in #1458
  • [Cache] Rename sparse compress cache directory by @LeiWang1999 in #1460
  • [Language]Adds a random number generation capability through curand_kernel by @silentCoder-dev in #1461
  • remove unused duplicated type check by @sgjzfzzf in #1462
  • feat(cutedsl): add CuTeDSL backend by @lucifer1004 in #1421
  • [Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py by @silentCoder-dev in #1464
  • [ArgBinder] Enhance shape variable handling and assertions by @LeiWang1999 in #1467
  • [Language] Make TL scripts friendly to Python syntax highlights by @SiriusNEO in #1466
  • [Refactor] Remove triton dependence in testing & move triton baseline into examples by @silentCoder-dev in #1470
  • [Language] Enhance T.dtype.as_torch conversion for compatibility by @LeiWang1999 in #1473
  • [News] update with latest news by @LeiWang1999 in #1475
  • [Enhancement] Use static Z3 context by @LeiWang1999 in #1482
  • [Enhancement] Enhance let binding handling in layout inference and warp specialized pass by @LeiWang1999 in #1484
  • [Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy by @LeiWang1999 in #1486
  • [Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator by @LeiWang1999 in #1491
  • [CI] Add preformance regression test script by @xwhzz in #1489
  • Pin nvidia-cutlass-dsl to 4.3.3 by @lucifer1004 in #1497
  • [Language] Remove ConstIf Frame for Better Meta-Programming by @kurisu6912 in #1496
  • [Bugfix][CI] Fix concurrency bug in regression test workflow by @xwhzz in #1500
  • [Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in #1495
  • [Enhancement] Optimize MHA varlen fwd and support autotune by @Rachmanino in #1499
  • [Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type by @LJC00118 in #1474
  • [Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled by @LeiWang1999 in #1502
  • Update cutedsl docs and version check by @lucifer1004 in #1503
  • [Misc] configure pymarkdown by @lucifer1004 in #1505
  • [Language] Fix gemm syntax highlight by @SiriusNEO in #1476
  • [Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi by @kurisu6912 in #1511
  • [Refactor] Phaseout execution_backend ctypes by @LeiWang1999 in #1510
  • [Testing] Add Memory Leak Test by @kurisu6912 in #1516
  • [Refactor] Support auto swizzling for tma store and phaseout related layout annotations by @LeiWang1999 in #1509
  • [CuTeDSL][Fix] thread safety + context safety by @lucifer1004 in #1513
  • [BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI by @tzj-fxz in #1515
  • [Cleanup] Remove unnecessary macros in tilelang examples by @Rachmanino in #1514
  • Fix ramp_lanes calculation in CUDA codegen by @LJC00118 in #1518
  • [Misc] add env for default target/backend/verbose by @lucifer1004 in #1512
  • [Dtype] Improve host codegen handling for subtype by @LeiWang1999 in #1517
  • [Bugfix] Fallback to a Linear Layout instead of raising errors by @LeiWang1999 in #1521
  • Use TargetIsCuda for all cuda target by @oraluben in #1522
  • Fix fp4 pointer arithmetic in CUDA codegen by @LJC00118 in #1524
  • [Enhancement] Improve GitHub Actions permissions check and refine performance regression testing by @xwhzz in #1519
  • [Release] Bump version into 0.1.7.post1 by @LeiWang1999 in #1506
  • [Pipeline] Refactor buffer allocation in Inject Pipeline Pass by @LeiWang1999 in #1525
  • [Dev] Fix when build local version with isolated build by @oraluben in #1487
  • [Bugfix] Skip stride check for subtype by @LeiWang1999 in #1531
  • [Lint] Enable whitespace and permission bit hooks by @XuehaiPan in #1439
  • [Enhancement][Tool] Tree-style pretty ASTPrinter by @SiriusNEO in #1468
  • [Fix] Add support for non-var complement arithmetic computation (#1374) by @kurisu6912 in #1533
  • [BugFix] Complete vectorized loading for common dtypes by @SiriusNEO in #1536
  • [Compat] Add CUDA version check for __nv_fp8_e8m0 type by @LeiWang1999 in #1537
  • [BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv by @hukongyi in #1530
  • [Bug] Fix hanging from reduction on sm120 by @PannenetsF in #1540
  • [example] use T.dynamic instead of tvm.te.var by @botbw in #1538
  • [Enhancement] Refactor KernelCache to use inheritance-based design by @sgjzfzzf in #1483
  • [Bugfix] Avoid considering local.var buffer as local by @LeiWang1999 in #1541
  • [Bugfix] Fix of T.Fill for local.var by @LeiWang1999 in #1543
  • [Z3] Change z3 timeout to rlimit for determistic prove behavior by @kurisu6912 in #1542
  • [Feat] Adapt gemm v2 for cutedsl backend by @lucifer1004 in #1544
  • [Enhancement] Support larger H in deepseek sparse mla backward via split-H by @Rachmanino in #1548
  • [Bugfix] Fix regression test to use installed package instead of source directory by @xwhzz in #1550
  • [Refactor] Introduce layout annotations for ParallelOPNode and CopyNode by @LeiWang1999 in #1539
  • [Script] Provide regression test script to help benchmark regression in local env by @LeiWang1999 in #1551
  • [Typing] Update Kernel signature and add type hints for buffer operations by @clouds56 in #1545
  • [CI]: Bump actions/upload-artifact from 4 to 6 by @dependabot[bot] in #1555
  • [Refactor] Use cuda capability from torch to be more generic by @oraluben in #1557
  • [CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #1556
  • [Host] Provide post process to customize host code and enhance nullable check by @LeiWang1999 in #1562
  • [Release] Build tilelang against CUDA 13.1 in CI by @oraluben in #1532
  • [LazyJIT] Move Type Annotations to Function Body by @kurisu6912 in #1480
  • [bugfix] fix missing clear_accum logic for gemm_sp_v2 by @botbw in #1563
  • [Misc] Remove unused tl_pipeline_sync. by @c8ef in #1566
  • [Refactor] Improve scalarization handling in Pass VectorizeLoop by @LeiWang1999 in #1565
  • [Refactor] Simplify do_bench calls by using default warmup and rep parameters by @LeiWang1999 in #1568
  • [CI] Refactor PR regression test job conditions by @xwhzz in #1569
  • [Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by @LeiWang1999 in #1559
  • [Refactor] Enhance deterministic ordering in shared memory allocation merge. by @LeiWang1999 in #1570
  • [Enhancement] Improve equality checks in layout nodes and fragment validation by @LeiWang1999 in #1573
  • [Feature] add kUseCooperativeLaunch tag for tvm_ffi by @silentCoder-dev in #1572
  • [Refactor] Remove unnecessary logging configuration in Analyzer.py by @LeiWang1999 in #1574
  • [Release] Bump version to 0.1.7.post2 by @LeiWang1999 in #1575
  • [BugFix] Change default rounding mode for fp4 conversions by @LJC00118 in #1580
  • [CI] Add CUDA-aware pytest scheduler + auto workers by @LeiWang1999 in #1584
  • [Enhancement] Improve performance regression output with timing and streaming by @xwhzz in #1585
  • [Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter by @haok1402 in #1589
  • [BugFix] Add PrimExpr substitution support for AttrStmt nodes by @LJC00118 in #1583
  • [BugFix] fix tcgen5mma example by @Rachmanino in #1577
  • [Refactor] Use access_ptr instead of buffer and offsets for cp async params by @LeiWang1999 in #1590
  • [Layout] Support annotating loop layout in frontend by @LeiWang1999 in #1579
  • [Typo] Rename loop layout annotation test by @LeiWang1999 in #1596
  • [Fix] Add register to read A ptr in test_tilelang_language_cooperative.py by @silentCoder-dev in #1593
  • [Feat] PDL Support by @w169q169 in #1494
  • [Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype by @LeiWang1999 in #1599
  • [Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) by @lucifer1004 in #1597
  • [Release] Fix race condition when publishing by @oraluben in #1578
  • Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 by @LJC00118 in #1600
  • [Enhancement][AMD] Add preshuffle fp8 gemm example on amd. by @Gongen-Ali in #1605
  • [Bugfix] Mangle Single Precision Mathematical Functions of cuda math api by @silentCoder-dev in #1602
  • [Bugfix] Open Rocm ci test and fix some bugs. by @Gongen-Ali in #1443
  • [Feature] Add more curand operations & support vectorization by @silentCoder-dev in #1582
  • [Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries by @XuehaiPan in #1481
  • [BugFix] Add pre-commit to requirements-dev.txt by @asaadkhaja99 in #1611
  • [BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop by @SiriusNEO in #1607
  • [Feat] Add strong checker to detect data racing in T.Parallel by @kurisu6912 in #1615
  • [Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin by @silentCoder-dev in #1614
  • [RaceChecker] RaceChecker report warning rather than error for backward compatibility by @kurisu6912 in #1620
  • [BugFix] Fix ForwardRef usage in v2 frontend (#1619) by @kurisu6912 in #1621
  • [Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse by @silentCoder-dev in #1622
  • [Feat] Improve T.reduce_absmax to use less abs call by @kurisu6912 in #1626
  • [Bugfix] Do not consider local.var as local buffer during LowerTileOP by @LeiWang1999 in #1628
  • [Feature] Add hoist_broadcast_values pass by @silentCoder-dev in #1606
  • [Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc by @clouds56 in #1528
  • [Bugfix] Fallback into full region when dynamic buffer read region cannot be proved by @LeiWang1999 in #1618
  • [Feat] Allow print macro call stack in device assert by @kurisu6912 in #1616
  • [BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr by @benenzhu in #1627
  • [Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 by @hammersam in #1636
  • [Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt by @LeiWang1999 in #1638
  • [Refactor][CI] Reduce sparse related test time by @LeiWang1999 in #1637
  • [Refactor] Unify @jit and @lazy_jit into a single @jit decorator by @LeiWang1999 in #1632
  • [Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen by @LeiWang1999 in #1650
  • [Bugfix] reverted unexpected tvm changes by @LeiWang1999 in #1651
  • [Bugfix] reverted unexpected tvm changes by @LeiWang1999 in #1652
  • [Refactor] Move dtypes.py from eager to language and add bits/bytes properties by @LeiWang1999 in #1646
  • [Feat] Allow dangling producer in wasp pipeline planning (#1263) by @kurisu6912 in #1647
  • [bugfix] fix smem alloc for single warp reduce by @botbw in #1643
  • [Example] Add attention sink varlen examples by @Rachmanino in #1645
  • [ASTPrinter] Fix IfThenElse printing and some format problems by @SiriusNEO in #1640
  • [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1610
  • [Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides by @Rachmanino in #1649
  • [Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py by @GoldenStain in #1634
  • [CUDA] Introduce simulated load/store 256bits access for CUDA compatibility by @LeiWang1999 in #1656
  • [Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case by @LeiWang1999 in #1654
  • [Bugfix] Fix missing annotations for default CallNode Visitor by @LeiWang1999 in #1659
  • [Clean] Remove unnecessary debug print by @LeiWang1999 in #1661
  • [Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies by @LeiWang1999 in #1657
  • [Refactor] Improve CallNode handling to include annotations in various operations by @LeiWang1999 in #1663
  • [EagerJIT] Add Support for Parameter Only Kernel Compilation by @kurisu6912 in #1664
  • [AutoDD] Add Tilelang AutoDD to Reduce Buggy Program by @KEKE046 in #1639
  • [Feature] Support cp.reduce.async.bulk.tensor by @Rachmanino in #1667
  • chore: update CI cutedsl version to 4.3.5 by @lucifer1004 in #1665
  • [CUDA] Enhance Broadcast Codegen for Symbolic Value by @LeiWang1999 in #1669
  • [EagerJIT] Fix bug in handling of positional arguments by @kurisu6912 in #1675
  • [Feature] Reimplement Threadsync with ConstrVisitor by @silentCoder-dev in #1631
  • [Clean][Refactor] Phaseout Legacy Pass ParallelLoopTransformer by @LeiWang1999 in #1672
  • [Feature] Atomic Reduction Operations and Vectorization Enhancement by @LeiWang1999 in #1676
  • [Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass by @LeiWang1999 in #1677
  • [Bugfix] Relax region analysis for complex expression by @LeiWang1999 in #1679
  • [Example] Add example for mHC inference kernels. by @Elevator14B in #1684
  • [Analyzer] Fix missing assume in tvm analyzer by @kurisu6912 in #1680
  • Refactor: Use centralized do_bench from tilelang.profiler by @LeiWang1999 in #1670
  • [Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization by @LeiWang1999 in #1644
  • [Release] Bump Version into v0.1.7.post3 by @LeiWang1999 in #1685
  • [Release] Fix release wheels by @oraluben in #1687
  • [BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug by @xiuhu17 in #1588
  • [Bugfix] Reorganize pass for thread_sync by @silentCoder-dev in #1682
  • [BugFix] fix warning on deepseek_v32 topk_selector.py by @sgjzfzzf in #1681
  • [tvm-ffi] Enable tvm-ffi for metal backend by @oraluben in #1289
  • [Analyzer] Fix missing assume in tvm analyzer by @LJC00118 in #1695
  • [Chore] Use python-side control flow keywords in examples for consistency by @Rachmanino in #1692
  • [Bugfix][Refactor] Always disable light storage reuse by @LeiWang1999 in #1691
  • [Enhancement] Log warnings for OOB acceses to non-global buffers by @SiriusNEO in #1693
  • Enhance loop vectorization logic for CallNode handling by @LeiWang1999 in #1696
  • [BugFix] Fix JITKernel export_library bug by @chengyupku in #1699
  • [Enhancement] Handle vectorizable calls by @LeiWang1999 in #1700
  • [BugFix] Fix unsafe visit else case under WarpSpecializationScope by @SiriusNEO in #1702
  • [Enhancement] Use cute::elect_one_sync() for slightly better performance by @Rachmanino in #1703
  • [Enhancement] Remove RewriteUnsafeSelect Pass by @LJC00118 in #1705
  • [BugFix] Corrected when proving loop layout contains a fragment buffer layout by @LeiWang1999 in #1708
  • [Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout by @LeiWang1999 in #1709
  • [BugFix] Add int64_t support for AtomicAdd by @LeiWang1999 in #1716
  • [Refactor] Introduce GemmInst enumeration and update warp partitioning logic by @Rachmanino in #1707
  • [Refactor] Phaseout unnecessary checks for pr #1707 by @LeiWang1999 in #1721
  • [Refactor] re-implement vector subtype and its access method by @LeiWang1999 in #1722
  • [EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT (#1690) by @kurisu6912 in #1694
  • [Enhancement] Legalize subtype access by @LeiWang1999 in #1724
  • [EagerJIT] Enhance auto inference of lazyjit and eager jit by @kurisu6912 in #1704
  • [Refactor] Enhance variable substitution in device function generation by @LeiWang1999 in #1723
  • [Bugfix] Fix incorrect alignment of vectorized subtype by @LeiWang1999 in #1726
  • [Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) by @LeiWang1999 in #1717
  • [Refactor] Remove external buffer conflict check in pipeline injection by @LeiWang1999 in #1727
  • [Refactor] Relocate layout transformation of ptx_stmatrix by @LeiWang1999 in #1689
  • [AMD] Add MI350/MI355 FP8 support by @hubertlu-tw in #1718
  • [Bugfix] revert incorrect fast path for parallel layout inference by @LeiWang1999 in #1730
  • [Example] Add KDA algorithm implementation in tilelang by @wfloveiu in #1660
  • [Feature] Support E8M0 related type conversion and vectorized cast by @SiriusNEO in #1731
  • [BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 by @kurisu6912 in #1735
  • Add swizzle layout detection and automatic merging for layout conflicts by @LeiWang1999 in #1736
  • [Bugfix] Handle offset handling for subtype ptr by @LeiWang1999 in #1738
  • [EagerJIT] Allow dummy parameter in jit kernel by @kurisu6912 in #1737
  • [Feature] Add build date to version metadata by @LeiWang1999 in #1742
  • [BugFix] Fix FP4 related vectorized cast by @chaospointer in #1741
  • [Refactor] Disable Predicated LDG PTX Lowering by default by @LeiWang1999 in #1739
  • [Layout] Fix Layout Bugs in Parallel and Reduce by @kurisu6912 in #1713
  • [fix]: fix deepseek_mla amd example and add aiter mla compare test by @ZiguanWang in #1740
  • [Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics by @Rachmanino in #1733
  • [BugFix] Fix several bugs in CodeGen for CuTeDSL backend by @Rachmanino in #1746
  • Update import for compare_tensors from test_utils_kda by @pmixer in #1748
  • [Lint] Remove diff arguments in Ruff and sync some versions by @SiriusNEO in #1751
  • [Refactor] Rename EagerJIT examples to avoid confusion by @SiriusNEO in #1750
  • [AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 by @hubertlu-tw in #1743
  • [Feature] Support message-only debug print by @Rachmanino in #1755
  • [EagerJIT] Update README example to eager jit by @kurisu6912 in #1752
  • [BugFix] Stride check and fix for tensors with zero-stride argument by @tzj-fxz in #1749
  • [BugFix] Always build guard in loop partitioning to prevent out-of-bounds access by @LeiWang1999 in #1756
  • [Tool] Add tool to print fragment in thread value view by @kurisu6912 in #1759
  • [Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking by @LeiWang1999 in #1753
  • [ThreadSync] Use Z3 for constraint equivalence checking by @LeiWang1999 in #1760
  • [Feature] Implement LoopUnswitching Pass by @chengyupku in #1747
  • [Chore] Remove unnecessary log from z3 by @Rachmanino in #1763
  • [Bugfix] Revert the initial value of Z3 SetRLimit by @LeiWang1999 in #1765
  • [Feature] Enhance Loop Unswitching with Let Binding and Condition Handling by @LeiWang1999 in #1766
  • [Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass by @LeiWang1999 in #1767
  • [Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass by @LeiWang1999 in #1769
  • [Fix] Change ue8m0 default round mode to cudaRoundPosInf by @SiriusNEO in #1770
  • [Feature] Support tcgen5mma lowering for .kind::i8 by @Rachmanino in #1764
  • [Refactor] Unify the usage of cast-related operators by @SiriusNEO in #1757
  • [Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations by @LeiWang1999 in #1776
  • [CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1775
  • [Refactor] Improve type annotations and reduce some lint errors in frontend by @SiriusNEO in #1777
  • Update TVM: fix select/if_then_else out-of-bounds access by @LeiWang1999 in #1783
  • [Feature] Add fully replicated layout interface in annotation layout by @tzj-fxz in #1772
  • [Example][BugFix] Fix arguements override in deepseek_v32 topk_selector by @ljwljwljwljw in #1784
  • [BugFix] Fix reduce_sum with clear=False not accumulating correctly by @ShaobinChen-AH in #1778
  • fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter by @Coloured-glaze in #1786
  • [Enhancement] Enhance register vectorize inference by @LeiWang1999 in #1785
  • [Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read by @LeiWang1999 in #1781
  • [Fix] cython 3.0 generates incorrect code for python stable api by @oraluben in #1789
  • [BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly by @xwhzz in #1794
  • [ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis by @LeiWang1999 in #1795
  • [Feature] Add option to disable out-of-bound access warnings in safe memory access legalization by @kurisu6912 in #1797
  • [Docs] Add Python Compatibility document of TileLang by @LeiWang1999 in #1745
  • [Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils by @LeiWang1999 in #1779
  • [Feature] Support passing PrimExpr value in tile-level atomic operation by @SiriusNEO in #1796
  • [Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined by @ljwljwljwljw in #1799
  • [BugFix] Missing Recursive Loop Var Checking in Loop Unswitching by @kurisu6912 in #1801
  • Fix a 3.9 issue. add _typing.py to dist check by @oraluben in #1803
  • [Docs][Puzzles] Add TileLang puzzles in README by @SiriusNEO in #1806
  • [Docs] Hotfix wrong link by @SiriusNEO in #1807
  • [Enhancement] Improve plot_layout visualization for Layouts by @LeiWang1999 in #1811
  • [Feat] profiler support cudagraph backend by @cscyuge in #1658
  • Handle staled autotune state with tvm-ffi adapter. by @haok1402 in #1812
  • [BugFix] LoopUnswitching: gate non-trivial else behind PassConfig by @LeiWang1999 in #1816
  • [Release] Update dependencies to resolve several issues by @oraluben in #1817
  • [BugFix] Fix fp16 annotate_l2_hit_ratio host stub compilation (issue #1810) by @LeiWang1999 in #1818
  • [Bugfix] Remove mistaken coalesced_width parameter in regression test of fusedmoe kernel by @xwhzz in #1820
  • [Release] Add build for python 3.14t by @oraluben in #1805
  • Fix: treat kParallel as serial when vectorizing by @LeiWang1999 in #1819
  • [Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels) by @LeiWang1999 in #1821
  • [Analyzer] Add SideEffect Checking in ConstIntBound Analyzer by @kurisu6912 in #1824
  • [Bugfix] Fix ast builder error for value -= 1 by @LeiWang1999 in #1825
  • [Release][Build] Merge libtilelang and libtilelang_modules by @oraluben in #1814
  • [Bugfix] Fix threadIdx variable lookup by thread_tag instead of position in ThreadSync by @LeiWang1999 in #1829
  • [Docs] Update nightly build installation instructions in README and Installation guide by @xwhzz in #1830
  • [BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier injection by @ColmaLiu in #1828
  • [Refactor] Treat local.var as local buffers when deciding vectorization for stable actions by @LeiWang1999 in #1835
  • Fix tilelang global load/store template by @LJC00118 in #1837
  • [Refactor] Introduce T.access_of to combine T.address_of and access_ptr by @LeiWang1999 in #1827
  • [CUDA][Feature] Add packed FP32x2 math intrinsics and auto vectorized support by @LeiWang1999 in #1839
  • [Example][BugFix] 1SM GEMM example on Blackwell and fix handling of mbar by @Rachmanino in #1774
  • [Feature] Hierarchical reduction and warp reduction intrinsics support by @tzj-fxz in #1762
  • [Dist][Release] Use one wheel for different CUDA version by @oraluben in #1826
  • [Enhancement] Optimize templates for half/bfloat16 by @LJC00118 in #1845
  • ThreadSync: avoid barriers between atomic ops by @LeiWang1999 in #1852
  • [BugFix] Fix eager mode where there is no tensor args by @Rachmanino in #1851
  • [AMD] Fix bugs about AMD FA kernel by @danielhua23 in #1701
  • Add an example: mHC residual projection backward by @Da1sypetals in #1758
  • [Release] Bump version into v0.1.8 by @LeiWang1999 in #1853

New Contributors

Full Changelog: v0.1.7...v0.1.8