[WebGPU] Add gating logic for subgroup shuffle primitives by ksgr5566 · Pull Request #1 · CharlieFRuan/tvm

ksgr5566 · 2026-02-24T10:32:11Z

Summary

This adds gating logic on top of apache#17699 to support optional subgroup shuffle
primitives based on a compile-time flag.

Problem

The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU.
However, not all WebGPU devices support subgroups. We need a way to:

Default to shared memory reductions (universally compatible)
Optionally enable subgroup shuffles for devices that support them

Solution

Implement gating via TVM target parameter:

Default thread_warp_size=1 disables warp reductions (uses shared memory + barriers)
Add target parser UpdateWebGPUAttrs() that sets thread_warp_size=32 when supports_subgroups=true
Add --enable-subgroups CLI flag in mlc-llm to surface the option to users

The gating happens at the reduction path selection level (IsWarpReduction() in
lower_thread_allreduce.cc), ensuring subgroup ops are never generated unless explicitly enabled.

Changes

TVM: Target parser + default thread_warp_size=1
MLC-LLM: --enable-subgroups flag ([WebGPU] Add --enable-subgroups flag for optional subgroup support mlc-ai/mlc-llm#3431)
WebLLM: WGSL shader dumping for verification

Testing

Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions;
with flag, generates subgroupShuffle* ops.
Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f

…he#18521)

…pache#18534) Add __init__ method to SearchStrategy class to prevent direct instantiation of the abstract class. This raises a TypeError with a helpful error message instead of causing a segmentation fault when SearchStrategy() is called directly or passed to TuneContext. Also add additional check in TuneContext.__init__ to ensure abstract SearchStrategy instances are not used. Fixes apache#18268

## Related Issue closes apache#18355 ## Why Converting PyTorch operations like M[:, rows, cols] = x failed because: 1. The TOPI index_put implementation called len() on TVM Tensor objects (unsupported) 2. Index tensors with different shapes (e.g., (2,) and (10,)) couldn't broadcast together ## How - Added broadcasting support following NumPy rules to handle multi-dimensional index tensors - add tests for batched indexing pattern M[:, rows, cols] = x

…#18536) Hi Commiters, This PR is trying to fix issues apache#17936. Any suggestions would be appreciated if you are available. ### Root Cause Code paths that expected vector evaluation but encounter the scalar-only evaluation `eval_vec_ = false` ### Solution `BroadcastNode` just replicates the same scalar value and the evaluation might not requires special vector-aware handling Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>

apache#18542) Reverts apache#18536

This PR is to resolve the issue apache#18481 , which fixes two bugs in the end-to-end optimization tutorial (`docs/how_to/tutorials/e2e_opt_model.py`) that prevented it from running correctly on GPU devices. ### Changes 1. **Added DefaultGPUSchedule transformation** - Apply `DefaultGPUSchedule` to ensure all GPU functions have proper thread binding. This fixes the memory verification error: "`Variable is directly accessed by host memory... Did you forget to bind?`" 2. **Fixed VM output handling** - Updated to correctly extract tensor from VM output.

…) is false (apache#18525) This commit fixes tvm.error.InternalError: Check failed: (index_map_func.has_value()) is false in [apache#18472](apache#18472) **Why** When using mma for MultiLevelTilingTensorCore, users must manually pass tvm.tir.tensor_intrin as an initializer to register it in LocalBuilder. This is inconsistent with the wmma workflow, where tvm.tir.tensor_intrin is imported by default in [tune_context.py](https://github.com/apache/tvm/blob/main/python/tvm/meta_schedule/tune_context.py#L109) to ensure that the TensorIntrin required by wmma is registered in advance. Additionally, the corresponding error message is not straightforward, which can be confusing for new users who are not familiar with TVM. **How** by adding import tensor_intrin in the default_build --------- Co-authored-by: Balint Cristian <cristian.balint@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

When forcing the use of MMA with MultiLevelTilingTensorCore or directly applying tensorization via the script below, the required shared memory size is significantly overestimated compared to the actual usage, at the same time, the accumulated result of mma is also incorrect. This issue stems from two root causes: 1. In `MmaToGlobal::Rewrite`, an extra threadIdx.x dimension is introduced when calling InsertCacheStage, which confuses the memory analysis and leads to inflated shared memory estimates. 2. In `get_mma_sync_intrin`, the offset computation for fragment C in get_index_C is incorrect, resulting in erroneous accumulation results. This PR addresses both issues to ensure accurate shared memory estimation and correct tensor core accumulation behavior. **How** This PR includes the following fixes: 1. Skip the threadIdx.x dimension in `InsertCacheStage` when it is not required, to prevent spurious shared memory overestimation and store repeatedly. 2. Correct the offset calculation for fragment C in `get_index_C` to ensure accurate accumulation results during tensor core execution. **Result** The above script produces results that match those of PyTorch. **Env** NVIDIA A100-SXM4-80GB

## Why - The interpolate operation was hardcoded to only support NCHW layout - Users need flexibility to choose the appropriate layout for their target platform ## How - Added default_image_layout parameter - Exposed default_image_layout parameter in the public from_fx()

## Why remove hard-code legacy code in ci

## Why - resolve todo in [test_op_gradient_numeric.py](https://github.com/apache/tvm/compare/main...guan404ming:update-conv2d-test?expand=1#diff-65bec2fe9ca46b486e6e1d3412e9092d25d3815bb6173435501bbfab7eefd87b) by unifying the dtype used in conv2d related test - use float32 with reduced range [0, 3] to maintain numerical precision for gradient checking

…he#18550) ## Why Fixes interpolation to support different scaling factors for height and width (e.g., scale_factor=[2.0, 3.0]) ## How - Removed the bug: Stopped extracting just the first element ([0]) from scale_factor lists - Passed full value: Now passes the entire scale_factor (scalar or list) to the underlying implementation, which already handles both correctly

…ON` and MLIR >= 15.0 (apache#18555) Fix a runtime error that occurs when TVM built with `USE_MLIR=ON` and MLIR >= 15.0, as shown below. ``` Traceback (most recent call last): File "<unknown>", line 0, in tvm::arith::__TVMFFIStaticInitFunc2() File "<unknown>", line 0, in tvm::ffi::reflection::ObjectDef<tvm::arith::PresburgerSetNode>::ObjectDef<>() File "<unknown>", line 0, in void tvm::ffi::reflection::ObjectDef<tvm::arith::PresburgerSetNode>::RegisterExtraInfo<>() File "build/src/ffi/object.cc", line 500, in TVMFFITypeRegisterMetadata File "src/ffi/object.cc", line 240, in void tvm::ffi::TypeTable::RegisterTypeMetadata(int32_t, const TVMFFITypeMetadata *) RuntimeError: Overriding arith.PresburgerSet, possible causes: - two ObjectDef<T>() calls for the same T - when we forget to assign _type_key to ObjectRef<Y> that inherits from T - another type with the same key is already registered Cross check the reflection registration. libc++abi: terminating due to uncaught exception of type tvm::ffi::Error ```

… to commit c71aefc) (apache#18545) - Expose max_trials_per_task parameter to static_shape_tuning_pipeline - Adjust default TOTAL_TRIALS from 8000 to 80 for tutorial demonstration purposes - Add documentation for tuning parameters in tutorial, clarifying relationship between MAX_TRIALS_PER_TASK and TOTAL_TRIALS

As per title. Just like [ModuleDict in PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.ModuleDict.html).

## How Add support for masked_select

The FlashInfer attention plan function introduced a new parameter of `num_colocated_ctas`. This commit updates the TVM caller side accordingly.

## Description This PR fixes a conversion bug that occurs when performing operations on `bfloat16` tensors. In conclusion, when applying the `BF16ComputeLegalize` compile pass and visiting a `BufferStoreNode`, if the stored value's dtype is different from the buffer's, `DTypeConversion()` should be used instead of a simple `cast` to apply the appropriate conversion logic. ## Test I added a test for this situation based on the existing tests. With the fix, `B[i] = A[i]` turns into `B[i] = bf16tof32(A[i])` properly, so the test passes. I'm not really sure whether the structure or name of this added test is appropriate. So let me gladly modify it if there is any comment on this. ## Process ### Problem observed This bug was identified when applying `nn.Linear()` to a `bfloat16` tensor resulted in excessively large numbers. While it appears to exist in other operations as well, it's particularly noticeable when the inner dimension of `MatMul` is a multiple of `8`(`16` for CUDA and ROCm). #### Example of problematic code ```python from ml_dtypes import bfloat16 import numpy as np from tvm.relax.frontend import nn from tvm.relax.frontend.nn import Tensor, op from tvm.target import Target n = 10 INNER_DIM = 8 * n # if INNER_DIM is a multiple of 8 class TestModule(nn.Module): def __init__(self): self.weight = nn.Parameter((32, INNER_DIM), dtype=dtype) def run(self, x: Tensor): t = op.matmul(self.weight, x, out_dtype=dtype) return t def get_default_spec(self): mod_spec = { "run": { "x": nn.spec.Tensor([INNER_DIM, 100], dtype), "$": { "param_mode": "packed", "effect_mode": "none", }, }, } return nn.spec.ModuleSpec.from_raw(mod_spec, self) def compile_module(...): ... def main(): target = "metal" # or "cuda", "vulkan", ... model = TestModule() ex, _ = compile_module(model, target) device = tvm.device(target, 0) vm = create_vm(ex, device=device) frun = vm["run"] params = [] param = tvm.runtime.empty( (32, INNER_DIM), dtype="bfloat16", device=device, ) param.copyfrom(np.ones((32, INNER_DIM), dtype=bfloat16)) params.append(param) inputs = np.ones((INNER_DIM, 100), dtype=bfloat16) arr = frun(inputs, params) print(f"{arr=}") # arr has weird values! ``` In cases where the inner dimension is not a multiple of `8`(or `16`), the issue was avoided by applying `T.if_then_else()` through `PadEinsum`. `PadEinsum` itself wasn't a troublemaker, and rather helped identify the issue. ### Problem Identified I could see the problems were avoided by wrapping an expression with `T.if_then_else()` or `T.cast()` before applying `BF16ComputeLegalize` compile pass. #### Statement with problem ```python weight_reindex_shared[v0, v1, v2] = weight[v1, v2] ``` #### Statements without problem ```python # 1) wrapped with T.if_then_else() weight_reindex_pad_shared[v0, v1, v2] = T.if_then_else(v2 < 511, weight[v1, v2], T.bfloat16(0.0)) # 2) wrapped with T.Cast() weight_reindex_pad_shared[v0, v1, v2] = T.Cast("float32", weight[v1, v2]) # ... ``` In the `BF16ComputeLegalize` compile pass, if a specific `Expr`(here, `weight[...]`) is processed through `PromoteToTarget()`(eventually, `DTypeConversion()`), the syntax changes to the syntax below(TO-BE), which applies the conversion logic. While the problematic statement simply applies `T.Cast()`(AS-IS). #### AS-IS ```python T.Cast("float32", weight[...]) ``` #### TO-BE ```python T.reinterpret("float32", T.shift_left(T.Cast("uint32", T.reinterpret("uint16", weight[...])), T.uint32(16))) ``` ### Fixing the problem This situation is caused by L332 in the code below. Changing this part to apply `DTypeConversion()` instead of `cast()` will resolve the issue. (In the cases that the `Expr` is wrapped with `T.if_then_else()` or something else, the `Expr` is processed properly in other visit functions through L312 or L313. So the problems were avoided.) #### L332 ```diff - value = cast(new_buf->dtype.with_lanes(value.dtype().lanes()), value); + value = DTypeConversion(value, new_buf->dtype.with_lanes(value.dtype().lanes())); ``` https://github.com/apache/tvm/blob/26b107fa12672c3b958da222fc87755a69d64c42/src/tir/transforms/unsupported_dtype_legalize.cc#L311-L338

@tlopex

…end (apache#18544) As per title. cc @tlopex @guan404ming We keep the interface same as [`from_fx()`](https://github.com/apache/tvm/blob/ed97234b25a155bc66198ab5cd9e372a4772acec/python/tvm/relax/frontend/torch/fx_translator.py#L1152) so you can define and pass custom converter something like this. ```python from tvm.relax.frontend.torch.exported_program_translator import ExportedProgramImporter def _rms_norm_converter(node: torch.fx.Node, self: ExportedProgramImporter) -> relax.Var: x = self.env[node.args[0]] torch_dtype = node.args[0].meta["tensor_meta"].dtype normalized_shape = node.args[1] weight = self.env.get(node.args[2], None) if len(node.args) > 2 else None eps = node.args[3] if len(node.args) > 3 else None N = len(self.shape_of(x)) D = len(normalized_shape) if isinstance(normalized_shape, (tuple, list)) else 1 axes = list(range(N - D, N)) if weight is None: weight = self._convert_torch_tensor_to_relax( torch.ones(list(normalized_shape), dtype=torch_dtype) ) eps = torch.finfo(torch_dtype).eps if eps is None else 0.00001 return self.block_builder.emit(relax.op.nn.rms_norm(x, weight, axes, eps)) mod = from_exported_program( exported_program, custom_convert_map={"rms_norm.default": _rms_norm_converter}, run_ep_decomposition=False, )

## How - Resolve todo by changing from raising error to calling _op_ffi_api.mod - Add both operators to the parametrized test

- Add edge padding mode - Add auto pad test

…pache#18554) ## Why Resolve todo in `fuse_tir.cc` by enhancing unique block name generation with numeric suffixes

…ep dim (apache#18583) ## Issue 1: Without Dim ### Summary: In _sum function (BaseFXGraphImporter), after retrieve_args, args[1] = [] and still pass into relax.op.sum so the result is incorrect. ### Steps to Reproduce - Module ``` class SumWithoutDim(nn.Module): def forward(self, x): return torch.sum(x) ``` ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2, 3), dtype="float32")): with R.dataflow(): lv: R.Tensor((2, 3), dtype="float32") = R.sum(x, axis=[], keepdims=False) gv: R.Tuple(R.Tensor((2, 3), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: Input: tensor([[1., 1., 1.], [1., 1., 1.]]) Torch output: tensor(6.) Torch output shape: torch.Size([]) TVM output: [[1. 1. 1.] [1. 1. 1.]] TVM output shape: (2, 3) ### Expected ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((), dtype="float32")): with R.dataflow(): lv: R.Tensor((), dtype="float32") = R.sum(x, axis=None, keepdims=False) gv: R.Tuple(R.Tensor((), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: TVM output: 6.0; TVM output shape: () ## Issue 2: Keep Dim ### Summary: In _sum function (BaseFXGraphImporter), previously keepdim value get only from node.kwargs and no pass into relax.op.sum. Now keepdim get more from args[2] and pass into. ### Steps to Reproduce - Module ``` class SumKeepDim(nn.Module): def forward(self, x): return torch.sum(x, dim=1, keepdim=True) ``` ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2,), dtype="float32")): with R.dataflow(): lv: R.Tensor((2,), dtype="float32") = R.sum(x, axis=[1], keepdims=False) gv: R.Tuple(R.Tensor((2,), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: Input: tensor([[1., 1., 1.], [1., 1., 1.]]) Torch output: tensor([[3.], [3.]]) Torch output shape: torch.Size([2, 1]) TVM VM output: [3. 3.] TVM VM output shape: (2,) ### Expected ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2, 1), dtype="float32")): with R.dataflow(): lv: R.Tensor((2, 1), dtype="float32") = R.sum(x, axis=[1], keepdims=True) gv: R.Tuple(R.Tensor((2, 1), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: TVM output: [[3.] [3.]] ;TVM output shape: (2, 1)

…empty vector (apache#18586) As per title.

The ACOS operator was producing incorrect results for boundary values due to poor precision of ASIN's Taylor series expansion near x=±1.0. Root cause: - ASIN used a 6-term Taylor series that converges slowly near boundaries - ACOS was implemented as acos(x) = π/2 - asin(x), inheriting ASIN errors - At x=1.0, ASIN error of 0.354874 (22.6%) caused ACOS to output 0.354874 instead of 0.0 Solution: - Modified ASIN to use system library function (asinf) for |x| >= 0.9 - Modified ACOS to use system library function (acosf) for |x| >= 0.9 - For |x| < 0.9, continue using Taylor series (accurate in this range) This ensures high precision for boundary values while maintaining the existing behavior for values in the middle range. Fixes apache#18580

…avoid undefined symbol on non-QCOM runtimes (apache#18589) This PR is a re-open of apache#18581 The previous PR was created while Jenkins CI was experiencing a disk space issue and the CI job did not trigger. ## PR Description Recent OpenCL-Headers update (KhronosGroup/OpenCL-Headers#277 ) added QCOM perf-hint definitions (`CL_CONTEXT_PERF_HINT_QCOM`, `clSetPerfHintQCOM`) to `cl_ext.h`. These macros are now defined even on platforms whose OpenCL runtimes (e.g., PoCL, ICD loaders) do not implement the QCOM extension. TVM previously enabled the perf-hint code path solely based on the presence of `CL_CONTEXT_PERF_HINT_QCOM`, causing link errors such as: ``` undefined symbol: clSetPerfHintQCOM ``` This PR guards the QCOM perf-hint logic behind `USE_OPENCL_EXTN_QCOM`, matching the behavior of other QCOM-specific OpenCL paths (e.g., `SetNativePtr`). ## Effects Prevents accidental linking against unsupported QCOM symbols on non-QCOM runtimes. Keeps QCOM builds fully functional when `USE_OPENCL_EXTN_QCOM` is explicitly enabled. Aligns TVM’s extension handling across OpenCL code paths. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## How - Implemented InferLayoutRepeat function that: - Preserves layout when axis is specified (with axis transformation) - Returns 1D layout when axis is not specified (flatten mode) - Transforms the axis parameter based on layout changes (e.g., NCHW axis=1 → NHWC axis=3)

## Why - LOG(WARNING) is the standard and correct approach throughout the TVM codebase - The existing pattern is used consistently in all relax ops (see test_op_manipulate.py, index.cc, etc.) - Added test coverage for previously untested scenarios

@mshr-h

…pache#18591) Hi @mshr-h @tlopex, This PR is trying to fix issue: DeprecationWarning: invalid escape sequence `\` Any suggestions would be appreciated if you are available. ### Root Cause The backslashes(`\`) inside the docstring <img width="1318" height="455" alt="image" src="https://github.com/user-attachments/assets/ca05ac7d-c598-4ec8-8bd3-a182994cbf9b" /> ### Solution Use a raw docstring(`r"""`) Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>

…ult'] (apache#18574) ## Summary Happen error when create module from exported_program have torch.mean without dim. ## Reproduce - Module: ``` class MeanModule(nn.Module): def forward(self, x): return torch.mean(x) ... # Export → Relax ep = torch_export(m, (x,)) mod = from_exported_program(ep) ``` - Error log: ``` --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[2], line 13 11 # Export → Relax 12 ep = torch_export(m, (x,)) ---> 13 mod = from_exported_program(ep) 15 mod.show() 17 target = "llvm" File ~/Programming/tvm/python/tvm/relax/frontend/torch/exported_program_translator.py:1783, in from_exported_program(exported_program, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple, run_ep_decomposition) 1780 if run_ep_decomposition: 1781 exported_program = exported_program.run_decompositions() -> 1783 return ExportedProgramImporter().from_exported_program( 1784 exported_program, 1785 keep_params_as_input, 1786 unwrap_unit_return_tuple, 1787 no_bind_return_tuple, 1788 ) File ~/Programming/tvm/python/tvm/relax/frontend/torch/exported_program_translator.py:1642, in ExportedProgramImporter.from_exported_program(self, exported_program, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple) 1639 nodes: List[fx.Node] = exported_program.graph.nodes 1641 # Find all the missing function types -> 1642 self._check_unsupported_func_type(nodes) 1644 with self.block_builder.function( 1645 name=func_name, params=list(inputs_vars.values()).copy(), attrs=func_attrs 1646 ): 1647 output = None File ~/Programming/tvm/python/tvm/relax/frontend/torch/base_fx_graph_translator.py:182, in BaseFXGraphImporter._check_unsupported_func_type(self, nodes) 174 def _check_unsupported_func_type(self, nodes: List[fx.Node]): 175 missing_func_types = list( 176 { 177 node.target.__name__ (...) 180 } 181 ) --> 182 assert not missing_func_types, f"Unsupported function types {missing_func_types}" AssertionError: Unsupported function types ['mean.default'] ``` ## Resolve: - Add "mean.default" into create_convert_map in class ExportedProgramImporter.

…st for all shuffle ops

…d TVMFFIABIBuilder (apache#18857)

This PR marks tuning related tests metaschedule tests as skip

## Summary - Remove `FewShotTuning` pass from Relax transform (C++ implementation, Python bindings, and test file) - The pass is unused in the current codebase and can be safely removed ## Files Changed - `include/tvm/relax/transform.h` — Remove declaration - `python/tvm/relax/transform/__init__.py` — Remove from imports - `python/tvm/relax/transform/transform.py` — Remove Python function - `src/relax/transform/few_shot_tuning.cc` — Delete (C++ implementation) - `tests/python/relax/test_transform_few_shot_tuning.py` — Delete (test file)

## Summary - Phase out 12 unused `tir::attr` constants (`scan_*`, `channel_*`, `pipeline_*`, `buffer_bind_scope`, `coproc_*`, `loop_scope`) and remove their dead code paths - Move 11 S-TIR-owned attributes (`async_*`, `double_buffer_*`, `fragment_*`, `pragma_loop_partition_hint`, `reduce_scope`, `virtual_thread`) from `tir::attr` to `s_tir::attr` - Alphabetize the remaining 15 `tir::attr` constants

Enable opencl target for gpu tests. Consolidates all Adreno tests under tests/python/relax/backend/adreno Changes to CLML corresponding to recent changes on json codegen/runtime. Docker specification for Adreno (ci_gpu + Android SDK, Gradle).

…er (apache#18865) ## Summary This PR introduces `AllocBufferNode`/`AllocBuffer` as a single TIR statement that both allocates memory and declares a buffer into scope. This replaces the previous pattern of `Allocate(var, dtype, shape, cond, DeclBuffer(buf, body))` with the simpler `AllocBuffer(buf, body)`. ### Main changes - **New IR node** `AllocBufferNode` with fields `{buffer, annotations, body}` — same semantics as `DeclBuffer` but also allocates memory - **TVMScript**: `T.alloc_buffer(shape, dtype, scope)` now emits `AllocBuffer` directly (statement-level allocation). `T.sblock_alloc_buffer(...)` for SBlock-level buffer allocation (full parameter set) - **All codegen backends** (C, CUDA, Metal, OpenCL, WebGPU, LLVM, NVPTX, AMDGPU, SPIR-V) updated to handle `AllocBufferNode` - **All TIR transforms** (storage_rewrite, flatten_buffer, vectorize_loop, lower_warp_memory, etc.) updated - **All S-TIR transforms** (compact_buffer_region, merge_shared_memory, inject_double_buffer, etc.) updated - **Removed `AllocateNode`** entirely — `AllocBuffer` is now the sole allocation primitive - **Removed `AllocDescriptor`** from merge_shared_memory_allocations — uses `Buffer` objects directly - **Added `AllocBuffer::ConstantAllocationSize()`** inline helper method ### Design rationale The old `Allocate + DeclBuffer` pair was a historical artifact: `AllocateNode` stored raw fields (`buffer_var`, `dtype`, `extents`, `condition`) separate from the `Buffer` object, requiring pattern matching (`IsAllocateDeclBufferPattern`) to reconstruct the buffer association. `AllocBuffer` unifies this into a single node with a proper `Buffer` reference, simplifying codegen backends and transform passes. 225 files changed, ~3500 insertions/deletions (net near-zero, mostly mechanical migration). ## Test plan - [x] All TIR base tests pass - [x] All TIR transform tests pass - [x] TVMScript roundtrip tests pass - [x] S-TIR transform tests pass - [x] Codegen tests pass - [x] All-platform minimal tests pass - [x] C++ functor tests pass - [x] Pre-commit clean (clang-format, ruff, etc.)

…or (apache#18873)

…ntics (apache#18874) ## Summary Rename `LetStmtNode`/`LetStmt` to `BindNode`/`Bind` and remove the `body` field. The variable defined by `Bind(var, value)` is now visible in all subsequent statements within the same enclosing scope, rather than being scoped to a nested body. This flattens deeply nested let-chains into sequential `SeqStmt([Bind(...), Bind(...), ...])`, making the IR easier to read, transform, and analyze. ## Key Changes - **New `BindNode`**: `{var, value}` — no body field. Variable scope is the enclosing statement's body (For, IfThenElse, AllocBuffer, etc.) - **ScopeStack pattern**: Passes that need scope-aware cleanup (ConvertSSA, CSE, tir_visitor_with_path) use `ScopeStack` instead of manual save/restore or RAII wrappers - **All passes migrated**: 89 files updated across codegen backends, TIR transforms, S-TIR transforms, analyses, TVMScript printer/parser/ir_builder

…8875) ## Summary - Fix Markdown link syntax for build.sh. - Correct docker/bash.sh usage examples to use proper image names and shortcuts. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

) ## Summary - Batch compute dispatches into a single GPUCommandEncoder, flushing on sync/readback instead of per-dispatch submit to reduce JS↔GPU transition overhead during LLM decode - Cache uniform buffers (FIFO/512), bind groups (FIFO/256), shape tuples, and pool MAP_READ staging buffers to eliminate redundant GPU object creation - Fix padding self-assignment bug in `deviceCopyToGPU`

## Summary Support dynamic `repeats` for ONNX Tile in the Relax frontend. ## Changes - add a dynamic Tile conversion path for ONNX when `repeats` is a graph input - expose `topi.dyn_tile` to the Python/packed TOPI interface - add frontend tests for dynamic `repeats` ## Validation - `tests/python/relax/test_frontend_onnx.py -k test_tile_dynamic_repeats -q` - local end-to-end repro matches ONNX Runtime ## Issue Fixes apache#18752

…8876) ## Summary - Remove `body` field from `AllocBufferNode` and `DeclBufferNode`, making them flat statements consistent with `Bind` - Buffer scope extends to end of enclosing scope via flat `SeqStmt` semantics - 60 files changed across core IR, codegen backends, transforms, script IR builder, and tests ## Test plan - All existing test suites pass (tir-transform, tir-base, tvmscript, s_tir, codegen, C++)

apache#18881) ## Summary - Fix `inject_texture_alloc.cc` to replace `AllocBuffer` with just a `Bind(nd_mem_alloc_with_scope)`, consistent with `LowerVtcmAlloc` - Previously kept a redundant `AllocBuffer` alongside the `Bind` in a `SeqStmt` Follow-up to apache#18876.

…generated `feature.*` attrs (apache#18883) Fix apache#18882 `TargetNode::ToConfig()` exports all target attrs, including derived `feature.*` fields set by target canonicalizers. However, `TargetInternal::FromConfig()` rejects these keys during schema validation because they are not declared in the target kind schema. This breaks round-tripping exported configs through `Target(config)`. This PR strips `feature.*` keys from the config before `ConfigSchema::Resolve`, then merges them back afterward. Canonicalizer output is authoritative — if the canonicalizer re-emits a `feature.*` key, it overwrites the preserved value. Unknown non-`feature.*` keys continue to fail validation as before. Changes: - src/target/target.cc: Extract and re-merge `feature.*` keys around schema resolution in `FromConfig()` - tests/cpp/target_test.cc: Add tests for single-target round-trip, nested-host round-trip, and continued rejection of unknown non-feature keys --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ache#18887) ## Summary - Fix `gpu_2d_continuous_cumsum` using `T.sblock_alloc_buffer` for `Tmp` buffer that is used across multiple kernel launches (not within a single sblock). Changed to `T.alloc_buffer`. - `T.sblock_alloc_buffer` places the buffer in SBlock metadata, making subsequent references to buffer dimensions (used by `ceil_log2`) undefined after the AllocBuffer/DeclBuffer refactor. Fixes apache#18885

## Summary This PR do a rebuild of TIR Common Subexpression Elimination (CSE) using a two-phase architecture: - **Phase 1 — CSEPlanner**: Read-only visitor that builds a scope tree and expression DAG. Computes a plan (InsertBeforeTable + ExprRemapTable) in a single pass using shallower-first processing with repr propagation — no cascade loop needed. - **Phase 2 — CSERewriter**: Mechanical mutator that inserts `Bind(cse_var, expr)` statements and substitutes expressions per the plan. Key improvements over the old implementation: - **Simpler architecture**: Two clean classes (planner + rewriter) instead of interleaved analysis/mutation - **No cascade loop**: Shallower-first processing with repr propagation resolves all CSE opportunities in one plan + one rewrite - **Incremental DAG construction**: Expression depth, children, and consumed counts computed during bottom-up scan — no separate traversals - **No single-use bindings**: Consumed count tracking avoids introducing bindings that would only be used once - **Unified insertion via VisitStmt**: SeqStmt flattening handles all insertion contexts uniformly Other changes: - Rename `CommonSubexprElimTIR` → `CommonSubexprElim`, remove `enable_cse_tir` and `identify_equiv_terms` params - Move old CSE tools (used by cache_index) to `cache_index_helpers.{cc,h}` - Remove unused `arith.detect_common_subexpr` API - Add `T.bind` as lowercase alias for `T.Bind`

…on (apache#18889) ## Summary - Rename `T.Bind` (capitalized) to `T.bind` (lowercase) to match TVMScript naming convention: statement builders use lowercase (`T.evaluate`, `T.buffer_store`, `T.bind`), expression constructors use capitalized (`T.Cast`, `T.Select`, `T.Let`) - Keep `Bind = bind` backward-compat alias - Update parser, printer references, and all test files ## Test plan - [x] tvmscript tests (771 passed) - [x] tir-transform tests (346 passed) - [x] tir-base tests (224 passed) - [x] pre-commit lint passes

## Summary Reject non-floating inputs for trig-style TIR unary ops. ## Changes - reject non-floating inputs for trig-style TIR unary ops such as `tan`, `sin`, and `cos` - add the same dtype check in the Python TIR wrapper so `topi.tan(int32)` fails early with a clear `TypeError` - add regression tests for `tvm.tir.tan(int32)` and `topi.tan(int32)` ## Validation - `tests/python/tir-base/test_tir_constructor.py -k 'math_unary_constructor_requires_float_dtype or topi_tan_requires_float_dtype' -q` - local repro for the original `where -> tan(int32)` case now fails early with `TypeError` - verified `topi.tan(float32)` still builds with `target="llvm"` ## Issue Fixes apache#18769

## Summary Reject non-float inputs for inverse trigonometric and hyperbolic unary ops in TOPI. ## Changes - add a shared floating-point dtype check for inverse unary math ops in TOPI - apply the check to `topi.acos`, `topi.acosh`, `topi.asin`, `topi.asinh`, and `topi.atanh` - add TE tests covering integer-input rejection for these ops - add regression tests covering successful LLVM build for both `float32` and `bfloat16` ## Validation - `tests/python/te/test_te_create_primfunc.py -k 'topi_float_unary'` - local repro now fails early with a clear `TypeError` for integer inputs - local regression check confirms the valid `float32` and `bfloat16` paths still compile with LLVM ## Issue Fixes apache#18729

## Summary - Remove `Bind = bind` backward-compat alias from `ir.py` - Remove `"Bind"` from `__all__` exports - Follows apache#18889 which renamed `T.Bind` → `T.bind` ## Test plan - [x] tvmscript roundtrip/printer/ir_builder tests pass (232 passed) - [x] pre-commit lint passes

…pache#18892) Add ^ anchor to the version regex so it matches only the top-level `version = "..."` instead of all three occurrences, which caused hit_count == 3 and a RuntimeError in sync_version.

…rating processor descriptions (apache#18884) ## Summary Fix false rejection of `apple-m1`, `apple-m2`, and `apple-m3` as LLVM CPU names when building TVM with LLVM 22+. ## Behavior After following the [installation from source instructions](https://tvm.apache.org/docs/install/from_source.html) and building against LLVM 22, every `import tvm` produces spurious error messages: ``` Error: Using LLVM 22.1.0 with `-mcpu=apple-m1` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` Error: Using LLVM 22.1.0 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` ``` These are triggered by the Metal target tag registrations in `python/tvm/target/tag_registry/metal.py`, which use `apple-m1` and `apple-m2` as the host `-mcpu`. The CPUs are silently downgraded to `generic`. ## Root cause LLVM 22 reorganized its AArch64 processor table. `apple-m1` through `apple-m3` are now CPU **aliases** — fully valid and accepted by `createTargetMachine` and `isCPUStringValid()`, but no longer returned by `MCSubtargetInfo::getAllProcessorDescriptions()`. TVM's `LLVMTargetInfo` constructor validates `-mcpu` by enumerating `getAllProcessorDescriptions()` and checking membership, so it misses alias-only names. ## Fix Replace the enumeration-based check with a new `IsValidCPU()` method that uses `MCSubtargetInfo::isCPUStringValid()`, which correctly handles both primary names and aliases. This API has been available since at least LLVM 7, well before TVM's minimum supported version. ## Validation - Built and tested on macOS (Apple Silicon) with LLVM 22.1.0 - `python -c "import tvm; print(tvm.__file__)"` produces clean output with no error messages --------- Co-authored-by: Gabriel Guralnick <gabriel@imbue.com>

…xportedProgram importer (apache#18903) Fixes `TypeError: 'NoneType' object is not iterable` when importing models with dynamic batch dimensions that contain identity slices (e.g., `x[:, :H, :W, :]` on a dynamic batch dim). **Root cause:** `aten.slice.Tensor(x, 0, 0, INT_MAX)` (an identity slice on a dynamic dim `s`) produces a result with shape `[T.min(INT_MAX, s), ...]` instead of `[s, ...]`. When this is combined with the original tensor via `add`, TVM cannot unify the shapes, resulting in `struct_info.shape = None`. Any subsequent `view`/`reshape` then crashes calling `list(None)`. This pattern appears in models like `swin_t`, where shifted window attention crops padded features with `x[:, :H, :W, :].contiguous()`. **Changes:** - `exported_program_translator.py`: Skip `strided_slice` for identity slices (`start=0, end>=INT_MAX, step=1`) and return the input tensor directly. - `base_fx_graph_translator.py`: Guard the identity-reshape check in `_reshape` against `None` shape.

guan404ming and others added 30 commits November 30, 2025 19:13

[Relax][PyTorch] Handle unknown output shapes for _sym_size_int (apac…

54c3fb4

…he#18521)

Revert "[ARITH] Fix InternalError: Check failed: (eval_vec_) is false" (

ed97234

apache#18542) Reverts apache#18536

[CI] Remove hardcoded user and repo values (apache#18549)

4c2249d

## Why remove hard-code legacy code in ci

[Relax][Frontend] Introduce ModuleDict (apache#18551)

2b4a1e2

As per title. Just like [ModuleDict in PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.ModuleDict.html).

[Relax][PyTorch] Add support for masked_select (apache#18535)

26b107f

## How Add support for masked_select

[Attn] Fix calling FlashInfer attention plan function (apache#18557)

e78fbd8

The FlashInfer attention plan function introduced a new parameter of `num_colocated_ctas`. This commit updates the TVM caller side accordingly.

[Relax] Add mod operator support (apache#18559)

8218b18

## How - Resolve todo by changing from raising error to calling _op_ffi_api.mod - Add both operators to the parametrized test

[Relax] Add edge padding mode (apache#18558)

04f06b5

- Add edge padding mode - Add auto pad test

[Relax] Enhance unique block name generation with numeric suffixes (a…

85a8770

…pache#18554) ## Why Resolve todo in `fuse_tir.cc` by enhancing unique block name generation with numeric suffixes

[LLVM][Codegen] Avoid segfault when arith::GetVScaleValues returns …

f2930d5

…empty vector (apache#18586) As per title.

ksgr5566 and others added 30 commits March 1, 2026 14:44

Lint errors fixed

e9697fe

[WebGPU] Add unit tests for subgroup warp reduction and fix uint32 ca…

d95827a

…st for all shuffle ops

[TIR][Refactor] Enhance error reporting with structured AssertStmt an…

611a815

…d TVMFFIABIBuilder (apache#18857)

[S-TIR][Test] Mark meta_schedule tuning tests as skip (apache#18863)

6627296

This PR marks tuning related tests metaschedule tests as skip

[CI] Update images to 20260301-134651-63f099ad (apache#18827)

75f8589

[Webgpu] Segment fault bug fix

397ac1b

[WebGPU] Add missing JSON implementation includes to WASM unity build

89d6142

[TIR] Add VisitBufferDef/VisitBufferUse to base StmtVisitor/StmtMutat…

969fad3

…or (apache#18873)

[Metal] Batched command dispatch and staging buffer pool (apache#18877)

9dfa116

[Build] Fix version regex to anchor at line start in pyproject.toml (a…

728eb15

…pache#18892) Add ^ anchor to the version regex so it matches only the top-level `version = "..."` instead of all three occurrences, which caused hit_count == 3 and a RuntimeError in sync_version.

Merge remote-tracking branch 'origin' into webgpu-subgroup-test

9a3edc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Add gating logic for subgroup shuffle primitives#1

[WebGPU] Add gating logic for subgroup shuffle primitives#1
ksgr5566 wants to merge 793 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566:webgpu-subgroup-test

ksgr5566 commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ksgr5566 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ksgr5566 commented Feb 24, 2026 •

edited

Loading