Release v0.1.8 · tile-ai/tilelang

What's Changed

[Bugfix][Build] Update CMake configuration to remove project root injection for sys.path by @LeiWang1999 in #1385
[BugFix] Fix split kernel layout bug of GQA decode by @tzj-fxz in #1386
[Feat] Add better repr print for Layout and Fragment by @kurisu6912 in #1392
[Doc] Logging docs for Tilelang/TVM by @SiriusNEO in #1395
[Enhancement] Refactor inflight computing to support dynamic pipeline extents by @LeiWang1999 in #1399
[AMD] Fix 3 bugs when build docker on amd mi3x gpu by @danielhua23 in #1401
[Typo] Fix tilelang link in README.md by @senlyu163 in #1402
[Dependency] Update apache-tvm-ffi version to >=0.1.2 by @LeiWang1999 in #1400
[AMD] Enable FA2 fwd on AMD MI300X by @danielhua23 in #1406
[Typo] fix typo for SM120 by @Cunxiao2002 in #1408
[Doc] Minor documentation update by @LeiWang1999 in #1410
[Dependency] Add torch-c-dlpack-ext to project requirements by @LeiWang1999 in #1403
[Bugfix] Alloc T.make_tensor not on the top of prim_func by @LeiWang1999 in #1412
[Enhancement] Introduce T.__ldg by @LeiWang1999 in #1414
[Enhancement] Improve vectorization invariant check by @LJC00118 in #1398
[Lint] Phaseout Yapf format and embrace ruff format by @LeiWang1999 in #1417
[Atomic] Use ptr for atomicAdd dst instead of reference by @LeiWang1999 in #1425
[CUDA] Add read-only parameter annotation for CUDA codegen by @LeiWang1999 in #1416
[Refactor] Phase out the primitives folder since its design has been merged into tileop by @LeiWang1999 in #1429
[CI]: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1431
[CI]: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1432
[Bugfix] Convey compile_flags to ffi compilation path with pass_configs by @LeiWang1999 in #1434
[Enhancement] Improve buffer usage tracking in MakePackedAPI by @LeiWang1999 in #1435
[Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice by @SiriusNEO in #1405
[Enhancement] Include PrimFunc name in memory cache logs for better ebugging by @LeiWang1999 in #1437
[CI] Update lint dependencies and fix lint on trunk by @XuehaiPan in #1433
[Enhancement] Refactor vectorization checks in loop_vectorize by @LeiWang1999 in #1440
[Enhancement] Implement vectorized FP8 to FP32 cast by @LJC00118 in #1438
[Feature] Support region as input of T.cumsum by @Dayuxiaoshui in #1426
[Fix] Fix analyzer bind conflicting bug in #1442 by @kurisu6912 in #1446
[Refactor] Reduce direct dependency on PyTorch due to its limited type support by @LeiWang1999 in #1444
[Refactor] Use pytest.mark.parameterize to speedup parallel testing by @kurisu6912 in #1447
[Docs] Improve installation instructions for developers by @SiriusNEO in #1450
[Feat] Integrate Z3 in TVM Arith Analyzer by @kurisu6912 in #1367
[Bugfix] Improve autotune from elementwise_add function in examples by @senlyu163 in #1445
[Language] Introduce T.annotate_restrict_buffers by @LeiWang1999 in #1428
[Analyzer] Require loop extent > 0 when entering loop (#1012) by @kurisu6912 in #1451
[BugFix] Update CI to ROCm-7.1 by @Gongen-Ali in #1449
[Enhancement] Update examples and tests for improved type handling functionality by @LeiWang1999 in #1448
[Issue Template] Enable blank issues in GitHub issue template by @LeiWang1999 in #1453
[CI] Moved the clang-tidy step to after pip install by @LeiWang1999 in #1456
[Bug] Fix tvm build script when patchelf is not found by @kurisu6912 in #1459
[Analyzer] Fix floordiv & floormod bug in z3 prover by @kurisu6912 in #1458
[Cache] Rename sparse compress cache directory by @LeiWang1999 in #1460
[Language]Adds a random number generation capability through curand_kernel by @silentCoder-dev in #1461
remove unused duplicated type check by @sgjzfzzf in #1462
feat(cutedsl): add CuTeDSL backend by @lucifer1004 in #1421
[Refactor] Rename test for curand & add triton baseline in test_tilelang_language_rand.py by @silentCoder-dev in #1464
[ArgBinder] Enhance shape variable handling and assertions by @LeiWang1999 in #1467
[Language] Make TL scripts friendly to Python syntax highlights by @SiriusNEO in #1466
[Refactor] Remove triton dependence in testing & move triton baseline into examples by @silentCoder-dev in #1470
[Language] Enhance T.dtype.as_torch conversion for compatibility by @LeiWang1999 in #1473
[News] update with latest news by @LeiWang1999 in #1475
[Enhancement] Use static Z3 context by @LeiWang1999 in #1482
[Enhancement] Enhance let binding handling in layout inference and warp specialized pass by @LeiWang1999 in #1484
[Refactor] Phaseout PassConfig kDisableDynamicTailSplit and kDynamicAlignment as they are legacy by @LeiWang1999 in #1486
[Enhancement] Optimize the time cost of critical path for IntervalSetEvaluator by @LeiWang1999 in #1491
[CI] Add preformance regression test script by @xwhzz in #1489
Pin nvidia-cutlass-dsl to 4.3.3 by @lucifer1004 in #1497
[Language] Remove ConstIf Frame for Better Meta-Programming by @kurisu6912 in #1496
[Bugfix][CI] Fix concurrency bug in regression test workflow by @xwhzz in #1500
[Refactor] Phaseout legacy alloc_local statement in examples and introduce processing for floating fragment buffers by @LeiWang1999 in #1495
[Enhancement] Optimize MHA varlen fwd and support autotune by @Rachmanino in #1499
[Enhancement] Refactor CUDA vectorized cast generation and remove unsupported FP8 type by @LJC00118 in #1474
[Dependency] Update apache-tvm-ffi to >=0.1.6 for memory safety when gc is not enabled by @LeiWang1999 in #1502
Update cutedsl docs and version check by @lucifer1004 in #1503
[Misc] configure pymarkdown by @lucifer1004 in #1505
[Language] Fix gemm syntax highlight by @SiriusNEO in #1476
[Fix] Fix TL_ENABLE_PTXAS_VERBOSE_OUTPUT has no effect in tvm-ffi by @kurisu6912 in #1511
[Refactor] Phaseout execution_backend ctypes by @LeiWang1999 in #1510
[Testing] Add Memory Leak Test by @kurisu6912 in #1516
[Refactor] Support auto swizzling for tma store and phaseout related layout annotations by @LeiWang1999 in #1509
[CuTeDSL][Fix] thread safety + context safety by @lucifer1004 in #1513
[BugFix] Phaseout unused tests for gqa decode kernels and add the kernels to CI by @tzj-fxz in #1515
[Cleanup] Remove unnecessary macros in tilelang examples by @Rachmanino in #1514
Fix ramp_lanes calculation in CUDA codegen by @LJC00118 in #1518
[Misc] add env for default target/backend/verbose by @lucifer1004 in #1512
[Dtype] Improve host codegen handling for subtype by @LeiWang1999 in #1517
[Bugfix] Fallback to a Linear Layout instead of raising errors by @LeiWang1999 in #1521
Use TargetIsCuda for all cuda target by @oraluben in #1522
Fix fp4 pointer arithmetic in CUDA codegen by @LJC00118 in #1524
[Enhancement] Improve GitHub Actions permissions check and refine performance regression testing by @xwhzz in #1519
[Release] Bump version into 0.1.7.post1 by @LeiWang1999 in #1506
[Pipeline] Refactor buffer allocation in Inject Pipeline Pass by @LeiWang1999 in #1525
[Dev] Fix when build local version with isolated build by @oraluben in #1487
[Bugfix] Skip stride check for subtype by @LeiWang1999 in #1531
[Lint] Enable whitespace and permission bit hooks by @XuehaiPan in #1439
[Enhancement][Tool] Tree-style pretty ASTPrinter by @SiriusNEO in #1468
[Fix] Add support for non-var complement arithmetic computation (#1374) by @kurisu6912 in #1533
[BugFix] Complete vectorized loading for common dtypes by @SiriusNEO in #1536
[Compat] Add CUDA version check for __nv_fp8_e8m0 type by @LeiWang1999 in #1537
[BugFix] Fix bugs of varlen attention forward examples caused by S_q != S_kv by @hukongyi in #1530
[Bug] Fix hanging from reduction on sm120 by @PannenetsF in #1540
[example] use T.dynamic instead of tvm.te.var by @botbw in #1538
[Enhancement] Refactor KernelCache to use inheritance-based design by @sgjzfzzf in #1483
[Bugfix] Avoid considering local.var buffer as local by @LeiWang1999 in #1541
[Bugfix] Fix of T.Fill for local.var by @LeiWang1999 in #1543
[Z3] Change z3 timeout to rlimit for determistic prove behavior by @kurisu6912 in #1542
[Feat] Adapt gemm v2 for cutedsl backend by @lucifer1004 in #1544
[Enhancement] Support larger H in deepseek sparse mla backward via split-H by @Rachmanino in #1548
[Bugfix] Fix regression test to use installed package instead of source directory by @xwhzz in #1550
[Refactor] Introduce layout annotations for ParallelOPNode and CopyNode by @LeiWang1999 in #1539
[Script] Provide regression test script to help benchmark regression in local env by @LeiWang1999 in #1551
[Typing] Update Kernel signature and add type hints for buffer operations by @clouds56 in #1545
[CI]: Bump actions/upload-artifact from 4 to 6 by @dependabot[bot] in #1555
[Refactor] Use cuda capability from torch to be more generic by @oraluben in #1557
[CI]: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #1556
[Host] Provide post process to customize host code and enhance nullable check by @LeiWang1999 in #1562
[Release] Build tilelang against CUDA 13.1 in CI by @oraluben in #1532
[LazyJIT] Move Type Annotations to Function Body by @kurisu6912 in #1480
[bugfix] fix missing clear_accum logic for gemm_sp_v2 by @botbw in #1563
[Misc] Remove unused tl_pipeline_sync. by @c8ef in #1566
[Refactor] Improve scalarization handling in Pass VectorizeLoop by @LeiWang1999 in #1565
[Refactor] Simplify do_bench calls by using default warmup and rep parameters by @LeiWang1999 in #1568
[CI] Refactor PR regression test job conditions by @xwhzz in #1569
[Parallel][Infer] Free-mode chooses minimal replication between buffer-based and PlanLoopPartition by @LeiWang1999 in #1559
[Refactor] Enhance deterministic ordering in shared memory allocation merge. by @LeiWang1999 in #1570
[Enhancement] Improve equality checks in layout nodes and fragment validation by @LeiWang1999 in #1573
[Feature] add kUseCooperativeLaunch tag for tvm_ffi by @silentCoder-dev in #1572
[Refactor] Remove unnecessary logging configuration in Analyzer.py by @LeiWang1999 in #1574
[Release] Bump version to 0.1.7.post2 by @LeiWang1999 in #1575
[BugFix] Change default rounding mode for fp4 conversions by @LJC00118 in #1580
[CI] Add CUDA-aware pytest scheduler + auto workers by @LeiWang1999 in #1584
[Enhancement] Improve performance regression output with timing and streaming by @xwhzz in #1585
[Bugfix] Add kernel_global_source property to TVMFFIKernelAdapter by @haok1402 in #1589
[BugFix] Add PrimExpr substitution support for AttrStmt nodes by @LJC00118 in #1583
[BugFix] fix tcgen5mma example by @Rachmanino in #1577
[Refactor] Use access_ptr instead of buffer and offsets for cp async params by @LeiWang1999 in #1590
[Layout] Support annotating loop layout in frontend by @LeiWang1999 in #1579
[Typo] Rename loop layout annotation test by @LeiWang1999 in #1596
[Fix] Add register to read A ptr in test_tilelang_language_cooperative.py by @silentCoder-dev in #1593
[Feat] PDL Support by @w169q169 in #1494
[Enhancement][Subtype] Enhance symbolic shape/stride handling for subtype by @LeiWang1999 in #1599
[Fix][CuteDSL] add support for tanh/tanhf (fixes #1595) by @lucifer1004 in #1597
[Release] Fix race condition when publishing by @oraluben in #1578
Add conversion from cutlass::float_e4m3/e5m2 to tl::float_e4m3/e5m2 by @LJC00118 in #1600
[Enhancement][AMD] Add preshuffle fp8 gemm example on amd. by @Gongen-Ali in #1605
[Bugfix] Mangle Single Precision Mathematical Functions of cuda math api by @silentCoder-dev in #1602
[Bugfix] Open Rocm ci test and fix some bugs. by @Gongen-Ali in #1443
[Feature] Add more curand operations & support vectorization by @silentCoder-dev in #1582
[Enhancement] Allow import tilelang on CPU-only machines without CUDA libraries by @XuehaiPan in #1481
[BugFix] Add pre-commit to requirements-dev.txt by @asaadkhaja99 in #1611
[BugFix] Fix some bugs in lowering ParallelOp and VectorizeLoop by @SiriusNEO in #1607
[Feat] Add strong checker to detect data racing in T.Parallel by @kurisu6912 in #1615
[Feature] add T.sync_warp & T.shfl_sync; change extern pdl into intrin by @silentCoder-dev in #1614
[RaceChecker] RaceChecker report warning rather than error for backward compatibility by @kurisu6912 in #1620
[BugFix] Fix ForwardRef usage in v2 frontend (#1619) by @kurisu6912 in #1621
[Refactor] Move ConstrVisitor to src/transform/common/constr_visitor.h for reuse by @silentCoder-dev in #1622
[Feat] Improve T.reduce_absmax to use less abs call by @kurisu6912 in #1626
[Bugfix] Do not consider local.var as local buffer during LowerTileOP by @LeiWang1999 in #1628
[Feature] Add hoist_broadcast_values pass by @silentCoder-dev in #1606
[Enhancement][CUDA] Support nvidia-cuda-nvcc as nvcc by @clouds56 in #1528
[Bugfix] Fallback into full region when dynamic buffer read region cannot be proved by @LeiWang1999 in #1618
[Feat] Allow print macro call stack in device assert by @kurisu6912 in #1616
[BugFix] Correct index_map selection for transposed A matrix in MFMA Layout with k_dim==4 and open rocm-ci for gemmsr by @benenzhu in #1627
[Example] Add Seesaw Sparse MLA Forward Kernel for DeepSeek-V3.2 by @hammersam in #1636
[Bugfix] Introduce a flag to avoid unnecessary broadcast hoist and enable for let stmt by @LeiWang1999 in #1638
[Refactor][CI] Reduce sparse related test time by @LeiWang1999 in #1637
[Refactor] Unify @jit and @lazy_jit into a single @jit decorator by @LeiWang1999 in #1632
[Bugfix] Fix pdl related intrin handling to avoid strict annotation codegen by @LeiWang1999 in #1650
[Bugfix] reverted unexpected tvm changes by @LeiWang1999 in #1651
[Bugfix] reverted unexpected tvm changes by @LeiWang1999 in #1652
[Refactor] Move dtypes.py from eager to language and add bits/bytes properties by @LeiWang1999 in #1646
[Feat] Allow dangling producer in wasp pipeline planning (#1263) by @kurisu6912 in #1647
[bugfix] fix smem alloc for single warp reduce by @botbw in #1643
[Example] Add attention sink varlen examples by @Rachmanino in #1645
[ASTPrinter] Fix IfThenElse printing and some format problems by @SiriusNEO in #1640
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1610
[Enhancement] Update LetStmtNode handling in loop vectorization to support variable binding overrides by @Rachmanino in #1649
[Example] Remove redundant T.copy in examples/deepseek_v32/sparse_mla_fwd.py by @GoldenStain in #1634
[CUDA] Introduce simulated load/store 256bits access for CUDA compatibility by @LeiWang1999 in #1656
[Enhancement] Improve unroll loop functionality for dynamic extent and corresponding test case by @LeiWang1999 in #1654
[Bugfix] Fix missing annotations for default CallNode Visitor by @LeiWang1999 in #1659
[Clean] Remove unnecessary debug print by @LeiWang1999 in #1661
[Bugfix] Fix variable scoping issue in InjectSoftwarePipeline for transitive LetStmt dependencies by @LeiWang1999 in #1657
[Refactor] Improve CallNode handling to include annotations in various operations by @LeiWang1999 in #1663
[EagerJIT] Add Support for Parameter Only Kernel Compilation by @kurisu6912 in #1664
[AutoDD] Add Tilelang AutoDD to Reduce Buggy Program by @KEKE046 in #1639
[Feature] Support cp.reduce.async.bulk.tensor by @Rachmanino in #1667
chore: update CI cutedsl version to 4.3.5 by @lucifer1004 in #1665
[CUDA] Enhance Broadcast Codegen for Symbolic Value by @LeiWang1999 in #1669
[EagerJIT] Fix bug in handling of positional arguments by @kurisu6912 in #1675
[Feature] Reimplement Threadsync with ConstrVisitor by @silentCoder-dev in #1631
[Clean][Refactor] Phaseout Legacy Pass ParallelLoopTransformer by @LeiWang1999 in #1672
[Feature] Atomic Reduction Operations and Vectorization Enhancement by @LeiWang1999 in #1676
[Refactor] Move AtomicAdd Vectorization to VectorizeLoop Pass by @LeiWang1999 in #1677
[Bugfix] Relax region analysis for complex expression by @LeiWang1999 in #1679
[Example] Add example for mHC inference kernels. by @Elevator14B in #1684
[Analyzer] Fix missing assume in tvm analyzer by @kurisu6912 in #1680
Refactor: Use centralized do_bench from tilelang.profiler by @LeiWang1999 in #1670
[Feature] Introduce DecoupleTypeCast pass for mixed-precision vectorization by @LeiWang1999 in #1644
[Release] Bump Version into v0.1.7.post3 by @LeiWang1999 in #1685
[Release] Fix release wheels by @oraluben in #1687
[BUG] Fix dsa_sparse_finetune/sparse_mla_bwd.py bug by @xiuhu17 in #1588
[Bugfix] Reorganize pass for thread_sync by @silentCoder-dev in #1682
[BugFix] fix warning on deepseek_v32 topk_selector.py by @sgjzfzzf in #1681
[tvm-ffi] Enable tvm-ffi for metal backend by @oraluben in #1289
[Analyzer] Fix missing assume in tvm analyzer by @LJC00118 in #1695
[Chore] Use python-side control flow keywords in examples for consistency by @Rachmanino in #1692
[Bugfix][Refactor] Always disable light storage reuse by @LeiWang1999 in #1691
[Enhancement] Log warnings for OOB acceses to non-global buffers by @SiriusNEO in #1693
Enhance loop vectorization logic for CallNode handling by @LeiWang1999 in #1696
[BugFix] Fix JITKernel export_library bug by @chengyupku in #1699
[Enhancement] Handle vectorizable calls by @LeiWang1999 in #1700
[BugFix] Fix unsafe visit else case under WarpSpecializationScope by @SiriusNEO in #1702
[Enhancement] Use cute::elect_one_sync() for slightly better performance by @Rachmanino in #1703
[Enhancement] Remove RewriteUnsafeSelect Pass by @LJC00118 in #1705
[BugFix] Corrected when proving loop layout contains a fragment buffer layout by @LeiWang1999 in #1708
[Bugfix] Improve robustness of ProveFragmentContains with fully replicated layout by @LeiWang1999 in #1709
[BugFix] Add int64_t support for AtomicAdd by @LeiWang1999 in #1716
[Refactor] Introduce GemmInst enumeration and update warp partitioning logic by @Rachmanino in #1707
[Refactor] Phaseout unnecessary checks for pr #1707 by @LeiWang1999 in #1721
[Refactor] re-implement vector subtype and its access method by @LeiWang1999 in #1722
[EagerJIT] Lazy Evaluation of Kernel Body in Eager JIT (#1690) by @kurisu6912 in #1694
[Enhancement] Legalize subtype access by @LeiWang1999 in #1724
[EagerJIT] Enhance auto inference of lazyjit and eager jit by @kurisu6912 in #1704
[Refactor] Enhance variable substitution in device function generation by @LeiWang1999 in #1723
[Bugfix] Fix incorrect alignment of vectorized subtype by @LeiWang1999 in #1726
[Enhancement] Add explicit global memory load/store intrinsics (ldg/stg 32/64/128) by @LeiWang1999 in #1717
[Refactor] Remove external buffer conflict check in pipeline injection by @LeiWang1999 in #1727
[Refactor] Relocate layout transformation of ptx_stmatrix by @LeiWang1999 in #1689
[AMD] Add MI350/MI355 FP8 support by @hubertlu-tw in #1718
[Bugfix] revert incorrect fast path for parallel layout inference by @LeiWang1999 in #1730
[Example] Add KDA algorithm implementation in tilelang by @wfloveiu in #1660
[Feature] Support E8M0 related type conversion and vectorized cast by @SiriusNEO in #1731
[BugFix] Remove unnecessary binding in loop variable analysis and add test for issue 1728 by @kurisu6912 in #1735
Add swizzle layout detection and automatic merging for layout conflicts by @LeiWang1999 in #1736
[Bugfix] Handle offset handling for subtype ptr by @LeiWang1999 in #1738
[EagerJIT] Allow dummy parameter in jit kernel by @kurisu6912 in #1737
[Feature] Add build date to version metadata by @LeiWang1999 in #1742
[BugFix] Fix FP4 related vectorized cast by @chaospointer in #1741
[Refactor] Disable Predicated LDG PTX Lowering by default by @LeiWang1999 in #1739
[Layout] Fix Layout Bugs in Parallel and Reduce by @kurisu6912 in #1713
[fix]: fix deepseek_mla amd example and add aiter mla compare test by @ZiguanWang in #1740
[Refactor] Enhance T.alloc_barrier with new features and deprecate legacy mbarrier related intrinsics by @Rachmanino in #1733
[BugFix] Fix several bugs in CodeGen for CuTeDSL backend by @Rachmanino in #1746
Update import for compare_tensors from test_utils_kda by @pmixer in #1748
[Lint] Remove diff arguments in Ruff and sync some versions by @SiriusNEO in #1751
[Refactor] Rename EagerJIT examples to avoid confusion by @SiriusNEO in #1750
[AMD] Fix ROCm FP8 dtype selection and MFMA support on gfx942/gfx950 by @hubertlu-tw in #1743
[Feature] Support message-only debug print by @Rachmanino in #1755
[EagerJIT] Update README example to eager jit by @kurisu6912 in #1752
[BugFix] Stride check and fix for tensors with zero-stride argument by @tzj-fxz in #1749
[BugFix] Always build guard in loop partitioning to prevent out-of-bounds access by @LeiWang1999 in #1756
[Tool] Add tool to print fragment in thread value view by @kurisu6912 in #1759
[Enhancement] Add dynamic symbolic constraints support for Profiler benchmarking by @LeiWang1999 in #1753
[ThreadSync] Use Z3 for constraint equivalence checking by @LeiWang1999 in #1760
[Feature] Implement LoopUnswitching Pass by @chengyupku in #1747
[Chore] Remove unnecessary log from z3 by @Rachmanino in #1763
[Bugfix] Revert the initial value of Z3 SetRLimit by @LeiWang1999 in #1765
[Feature] Enhance Loop Unswitching with Let Binding and Condition Handling by @LeiWang1999 in #1766
[Bugfix] Add predicate to loads inside predicated stores in LowerLDGSTG pass by @LeiWang1999 in #1767
[Feature] Add PassConfig for Controlling Let Statement Inlining in Simplify Pass by @LeiWang1999 in #1769
[Fix] Change ue8m0 default round mode to cudaRoundPosInf by @SiriusNEO in #1770
[Feature] Support tcgen5mma lowering for .kind::i8 by @Rachmanino in #1764
[Refactor] Unify the usage of cast-related operators by @SiriusNEO in #1757
[Bugfix] Copy pass_configs dict to prevent mutation across multiple JIT compilations by @LeiWang1999 in #1776
[CI] [pre-commit.ci] autoupdate by @pre-commit-ci[bot] in #1775
[Refactor] Improve type annotations and reduce some lint errors in frontend by @SiriusNEO in #1777
Update TVM: fix select/if_then_else out-of-bounds access by @LeiWang1999 in #1783
[Feature] Add fully replicated layout interface in annotation layout by @tzj-fxz in #1772
[Example][BugFix] Fix arguements override in deepseek_v32 topk_selector by @ljwljwljwljw in #1784
[BugFix] Fix reduce_sum with clear=False not accumulating correctly by @ShaobinChen-AH in #1778
fix(intrinsics): add missing _legalize_to_buffer_region in SM70 emitter by @Coloured-glaze in #1786
[Enhancement] Enhance register vectorize inference by @LeiWang1999 in #1785
[Bugfix] Fix thread storage sync conflict detection for loop carry write-after-read by @LeiWang1999 in #1781
[Fix] cython 3.0 generates incorrect code for python stable api by @oraluben in #1789
[BugFix] Update buffer access in TensorCoreIntrinEmitter to handle variable dimensions correctly by @xwhzz in #1794
[ThreadSync] Skip (tx1 != tx2) checking for loop carry analysis by @LeiWang1999 in #1795
[Feature] Add option to disable out-of-bound access warnings in safe memory access legalization by @kurisu6912 in #1797
[Docs] Add Python Compatibility document of TileLang by @LeiWang1999 in #1745
[Refactor] Reorganize ParallelOp code structure and move ProveFragmentContains to layout utils by @LeiWang1999 in #1779
[Feature] Support passing PrimExpr value in tile-level atomic operation by @SiriusNEO in #1796
[Bugfix] Support loop-dependent conditions in IfThenElse within T.Pipelined by @ljwljwljwljw in #1799
[BugFix] Missing Recursive Loop Var Checking in Loop Unswitching by @kurisu6912 in #1801
Fix a 3.9 issue. add _typing.py to dist check by @oraluben in #1803
[Docs][Puzzles] Add TileLang puzzles in README by @SiriusNEO in #1806
[Docs] Hotfix wrong link by @SiriusNEO in #1807
[Enhancement] Improve plot_layout visualization for Layouts by @LeiWang1999 in #1811
[Feat] profiler support cudagraph backend by @cscyuge in #1658
Handle staled autotune state with tvm-ffi adapter. by @haok1402 in #1812
[BugFix] LoopUnswitching: gate non-trivial else behind PassConfig by @LeiWang1999 in #1816
[Release] Update dependencies to resolve several issues by @oraluben in #1817
[BugFix] Fix fp16 annotate_l2_hit_ratio host stub compilation (issue #1810) by @LeiWang1999 in #1818
[Bugfix] Remove mistaken coalesced_width parameter in regression test of fusedmoe kernel by @xwhzz in #1820
[Release] Add build for python 3.14t by @oraluben in #1805
Fix: treat kParallel as serial when vectorizing by @LeiWang1999 in #1819
[Dist] Add lazy-loading stubs for CUDART + NVRTC (CUDA 11/12/13 compatible wheels) by @LeiWang1999 in #1821
[Analyzer] Add SideEffect Checking in ConstIntBound Analyzer by @kurisu6912 in #1824
[Bugfix] Fix ast builder error for value -= 1 by @LeiWang1999 in #1825
[Release][Build] Merge libtilelang and libtilelang_modules by @oraluben in #1814
[Bugfix] Fix threadIdx variable lookup by thread_tag instead of position in ThreadSync by @LeiWang1999 in #1829
[Docs] Update nightly build installation instructions in README and Installation guide by @xwhzz in #1830
[BugFix] Reset cur_expect_idx_ correctly for multi-kernel TMA barrier injection by @ColmaLiu in #1828
[Refactor] Treat local.var as local buffers when deciding vectorization for stable actions by @LeiWang1999 in #1835
Fix tilelang global load/store template by @LJC00118 in #1837
[Refactor] Introduce T.access_of to combine T.address_of and access_ptr by @LeiWang1999 in #1827
[CUDA][Feature] Add packed FP32x2 math intrinsics and auto vectorized support by @LeiWang1999 in #1839
[Example][BugFix] 1SM GEMM example on Blackwell and fix handling of mbar by @Rachmanino in #1774
[Feature] Hierarchical reduction and warp reduction intrinsics support by @tzj-fxz in #1762
[Dist][Release] Use one wheel for different CUDA version by @oraluben in #1826
[Enhancement] Optimize templates for half/bfloat16 by @LJC00118 in #1845
ThreadSync: avoid barriers between atomic ops by @LeiWang1999 in #1852
[BugFix] Fix eager mode where there is no tensor args by @Rachmanino in #1851
[AMD] Fix bugs about AMD FA kernel by @danielhua23 in #1701
Add an example: mHC residual projection backward by @Da1sypetals in #1758
[Release] Bump version into v0.1.8 by @LeiWang1999 in #1853

New Contributors

@danielhua23 made their first contribution in #1401
@senlyu163 made their first contribution in #1402
@Dayuxiaoshui made their first contribution in #1426
@silentCoder-dev made their first contribution in #1461
@sgjzfzzf made their first contribution in #1462
@hukongyi made their first contribution in #1530
@clouds56 made their first contribution in #1545
@c8ef made their first contribution in #1566
@haok1402 made their first contribution in #1589
@w169q169 made their first contribution in #1494
@asaadkhaja99 made their first contribution in #1611
@hammersam made their first contribution in #1636
@GoldenStain made their first contribution in #1634
@KEKE046 made their first contribution in #1639
@xiuhu17 made their first contribution in #1588
@hubertlu-tw made their first contribution in #1718
@wfloveiu made their first contribution in #1660
@chaospointer made their first contribution in #1741
@ZiguanWang made their first contribution in #1740
@pmixer made their first contribution in #1748
@ljwljwljwljw made their first contribution in #1784
@ShaobinChen-AH made their first contribution in #1778
@Coloured-glaze made their first contribution in #1786
@cscyuge made their first contribution in #1658
@ColmaLiu made their first contribution in #1828
@Da1sypetals made their first contribution in #1758

Full Changelog: v0.1.7...v0.1.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.8

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!