[WebGPU] Add gating logic for subgroup shuffle primitives#1
Open
ksgr5566 wants to merge 793 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
Open
[WebGPU] Add gating logic for subgroup shuffle primitives#1ksgr5566 wants to merge 793 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566 wants to merge 793 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
Conversation
…pache#18534) Add __init__ method to SearchStrategy class to prevent direct instantiation of the abstract class. This raises a TypeError with a helpful error message instead of causing a segmentation fault when SearchStrategy() is called directly or passed to TuneContext. Also add additional check in TuneContext.__init__ to ensure abstract SearchStrategy instances are not used. Fixes apache#18268
## Related Issue closes apache#18355 ## Why Converting PyTorch operations like M[:, rows, cols] = x failed because: 1. The TOPI index_put implementation called len() on TVM Tensor objects (unsupported) 2. Index tensors with different shapes (e.g., (2,) and (10,)) couldn't broadcast together ## How - Added broadcasting support following NumPy rules to handle multi-dimensional index tensors - add tests for batched indexing pattern M[:, rows, cols] = x
…#18536) Hi Commiters, This PR is trying to fix issues apache#17936. Any suggestions would be appreciated if you are available. ### Root Cause Code paths that expected vector evaluation but encounter the scalar-only evaluation `eval_vec_ = false` ### Solution `BroadcastNode` just replicates the same scalar value and the evaluation might not requires special vector-aware handling Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
This PR is to resolve the issue apache#18481 , which fixes two bugs in the end-to-end optimization tutorial (`docs/how_to/tutorials/e2e_opt_model.py`) that prevented it from running correctly on GPU devices. ### Changes 1. **Added DefaultGPUSchedule transformation** - Apply `DefaultGPUSchedule` to ensure all GPU functions have proper thread binding. This fixes the memory verification error: "`Variable is directly accessed by host memory... Did you forget to bind?`" 2. **Fixed VM output handling** - Updated to correctly extract tensor from VM output.
…) is false (apache#18525) This commit fixes tvm.error.InternalError: Check failed: (index_map_func.has_value()) is false in [apache#18472](apache#18472) **Why** When using mma for MultiLevelTilingTensorCore, users must manually pass tvm.tir.tensor_intrin as an initializer to register it in LocalBuilder. This is inconsistent with the wmma workflow, where tvm.tir.tensor_intrin is imported by default in [tune_context.py](https://github.com/apache/tvm/blob/main/python/tvm/meta_schedule/tune_context.py#L109) to ensure that the TensorIntrin required by wmma is registered in advance. Additionally, the corresponding error message is not straightforward, which can be confusing for new users who are not familiar with TVM. **How** by adding import tensor_intrin in the default_build --------- Co-authored-by: Balint Cristian <cristian.balint@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
When forcing the use of MMA with MultiLevelTilingTensorCore or directly applying tensorization via the script below, the required shared memory size is significantly overestimated compared to the actual usage, at the same time, the accumulated result of mma is also incorrect. This issue stems from two root causes: 1. In `MmaToGlobal::Rewrite`, an extra threadIdx.x dimension is introduced when calling InsertCacheStage, which confuses the memory analysis and leads to inflated shared memory estimates. 2. In `get_mma_sync_intrin`, the offset computation for fragment C in get_index_C is incorrect, resulting in erroneous accumulation results. This PR addresses both issues to ensure accurate shared memory estimation and correct tensor core accumulation behavior. **How** This PR includes the following fixes: 1. Skip the threadIdx.x dimension in `InsertCacheStage` when it is not required, to prevent spurious shared memory overestimation and store repeatedly. 2. Correct the offset calculation for fragment C in `get_index_C` to ensure accurate accumulation results during tensor core execution. **Result** The above script produces results that match those of PyTorch. **Env** NVIDIA A100-SXM4-80GB
## Why - The interpolate operation was hardcoded to only support NCHW layout - Users need flexibility to choose the appropriate layout for their target platform ## How - Added default_image_layout parameter - Exposed default_image_layout parameter in the public from_fx()
## Why remove hard-code legacy code in ci
## Why - resolve todo in [test_op_gradient_numeric.py](https://github.com/apache/tvm/compare/main...guan404ming:update-conv2d-test?expand=1#diff-65bec2fe9ca46b486e6e1d3412e9092d25d3815bb6173435501bbfab7eefd87b) by unifying the dtype used in conv2d related test - use float32 with reduced range [0, 3] to maintain numerical precision for gradient checking
…he#18550) ## Why Fixes interpolation to support different scaling factors for height and width (e.g., scale_factor=[2.0, 3.0]) ## How - Removed the bug: Stopped extracting just the first element ([0]) from scale_factor lists - Passed full value: Now passes the entire scale_factor (scalar or list) to the underlying implementation, which already handles both correctly
…ON` and MLIR >= 15.0 (apache#18555) Fix a runtime error that occurs when TVM built with `USE_MLIR=ON` and MLIR >= 15.0, as shown below. ``` Traceback (most recent call last): File "<unknown>", line 0, in tvm::arith::__TVMFFIStaticInitFunc2() File "<unknown>", line 0, in tvm::ffi::reflection::ObjectDef<tvm::arith::PresburgerSetNode>::ObjectDef<>() File "<unknown>", line 0, in void tvm::ffi::reflection::ObjectDef<tvm::arith::PresburgerSetNode>::RegisterExtraInfo<>() File "build/src/ffi/object.cc", line 500, in TVMFFITypeRegisterMetadata File "src/ffi/object.cc", line 240, in void tvm::ffi::TypeTable::RegisterTypeMetadata(int32_t, const TVMFFITypeMetadata *) RuntimeError: Overriding arith.PresburgerSet, possible causes: - two ObjectDef<T>() calls for the same T - when we forget to assign _type_key to ObjectRef<Y> that inherits from T - another type with the same key is already registered Cross check the reflection registration. libc++abi: terminating due to uncaught exception of type tvm::ffi::Error ```
… to commit c71aefc) (apache#18545) - Expose max_trials_per_task parameter to static_shape_tuning_pipeline - Adjust default TOTAL_TRIALS from 8000 to 80 for tutorial demonstration purposes - Add documentation for tuning parameters in tutorial, clarifying relationship between MAX_TRIALS_PER_TASK and TOTAL_TRIALS
As per title. Just like [ModuleDict in PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.nn.ModuleDict.html).
## How Add support for masked_select
The FlashInfer attention plan function introduced a new parameter of `num_colocated_ctas`. This commit updates the TVM caller side accordingly.
## Description
This PR fixes a conversion bug that occurs when performing operations on
`bfloat16` tensors.
In conclusion, when applying the `BF16ComputeLegalize` compile pass and
visiting a `BufferStoreNode`, if the stored value's dtype is different
from the buffer's, `DTypeConversion()` should be used instead of a
simple `cast` to apply the appropriate conversion logic.
## Test
I added a test for this situation based on the existing tests.
With the fix, `B[i] = A[i]` turns into `B[i] = bf16tof32(A[i])`
properly, so the test passes.
I'm not really sure whether the structure or name of this added test is
appropriate.
So let me gladly modify it if there is any comment on this.
## Process
### Problem observed
This bug was identified when applying `nn.Linear()` to a `bfloat16`
tensor resulted in excessively large numbers.
While it appears to exist in other operations as well, it's particularly
noticeable when the inner dimension of `MatMul` is a multiple of
`8`(`16` for CUDA and ROCm).
#### Example of problematic code
```python
from ml_dtypes import bfloat16
import numpy as np
from tvm.relax.frontend import nn
from tvm.relax.frontend.nn import Tensor, op
from tvm.target import Target
n = 10
INNER_DIM = 8 * n # if INNER_DIM is a multiple of 8
class TestModule(nn.Module):
def __init__(self):
self.weight = nn.Parameter((32, INNER_DIM), dtype=dtype)
def run(self, x: Tensor):
t = op.matmul(self.weight, x, out_dtype=dtype)
return t
def get_default_spec(self):
mod_spec = {
"run": {
"x": nn.spec.Tensor([INNER_DIM, 100], dtype),
"$": {
"param_mode": "packed",
"effect_mode": "none",
},
},
}
return nn.spec.ModuleSpec.from_raw(mod_spec, self)
def compile_module(...):
...
def main():
target = "metal" # or "cuda", "vulkan", ...
model = TestModule()
ex, _ = compile_module(model, target)
device = tvm.device(target, 0)
vm = create_vm(ex, device=device)
frun = vm["run"]
params = []
param = tvm.runtime.empty(
(32, INNER_DIM),
dtype="bfloat16",
device=device,
)
param.copyfrom(np.ones((32, INNER_DIM), dtype=bfloat16))
params.append(param)
inputs = np.ones((INNER_DIM, 100), dtype=bfloat16)
arr = frun(inputs, params)
print(f"{arr=}") # arr has weird values!
```
In cases where the inner dimension is not a multiple of `8`(or `16`),
the issue was avoided by applying `T.if_then_else()` through
`PadEinsum`. `PadEinsum` itself wasn't a troublemaker, and rather helped
identify the issue.
### Problem Identified
I could see the problems were avoided by wrapping an expression with
`T.if_then_else()` or `T.cast()` before applying `BF16ComputeLegalize`
compile pass.
#### Statement with problem
```python
weight_reindex_shared[v0, v1, v2] = weight[v1, v2]
```
#### Statements without problem
```python
# 1) wrapped with T.if_then_else()
weight_reindex_pad_shared[v0, v1, v2] = T.if_then_else(v2 < 511, weight[v1, v2], T.bfloat16(0.0))
# 2) wrapped with T.Cast()
weight_reindex_pad_shared[v0, v1, v2] = T.Cast("float32", weight[v1, v2])
# ...
```
In the `BF16ComputeLegalize` compile pass, if a specific `Expr`(here,
`weight[...]`) is processed through `PromoteToTarget()`(eventually,
`DTypeConversion()`), the syntax changes to the syntax below(TO-BE),
which applies the conversion logic. While the problematic statement
simply applies `T.Cast()`(AS-IS).
#### AS-IS
```python
T.Cast("float32", weight[...])
```
#### TO-BE
```python
T.reinterpret("float32", T.shift_left(T.Cast("uint32", T.reinterpret("uint16", weight[...])), T.uint32(16)))
```
### Fixing the problem
This situation is caused by L332 in the code below. Changing this part
to apply `DTypeConversion()` instead of `cast()` will resolve the issue.
(In the cases that the `Expr` is wrapped with `T.if_then_else()` or
something else, the `Expr` is processed properly in other visit
functions through L312 or L313. So the problems were avoided.)
#### L332
```diff
- value = cast(new_buf->dtype.with_lanes(value.dtype().lanes()), value);
+ value = DTypeConversion(value, new_buf->dtype.with_lanes(value.dtype().lanes()));
```
https://github.com/apache/tvm/blob/26b107fa12672c3b958da222fc87755a69d64c42/src/tir/transforms/unsupported_dtype_legalize.cc#L311-L338
…end (apache#18544) As per title. cc @tlopex @guan404ming We keep the interface same as [`from_fx()`](https://github.com/apache/tvm/blob/ed97234b25a155bc66198ab5cd9e372a4772acec/python/tvm/relax/frontend/torch/fx_translator.py#L1152) so you can define and pass custom converter something like this. ```python from tvm.relax.frontend.torch.exported_program_translator import ExportedProgramImporter def _rms_norm_converter(node: torch.fx.Node, self: ExportedProgramImporter) -> relax.Var: x = self.env[node.args[0]] torch_dtype = node.args[0].meta["tensor_meta"].dtype normalized_shape = node.args[1] weight = self.env.get(node.args[2], None) if len(node.args) > 2 else None eps = node.args[3] if len(node.args) > 3 else None N = len(self.shape_of(x)) D = len(normalized_shape) if isinstance(normalized_shape, (tuple, list)) else 1 axes = list(range(N - D, N)) if weight is None: weight = self._convert_torch_tensor_to_relax( torch.ones(list(normalized_shape), dtype=torch_dtype) ) eps = torch.finfo(torch_dtype).eps if eps is None else 0.00001 return self.block_builder.emit(relax.op.nn.rms_norm(x, weight, axes, eps)) mod = from_exported_program( exported_program, custom_convert_map={"rms_norm.default": _rms_norm_converter}, run_ep_decomposition=False, )
## How - Resolve todo by changing from raising error to calling _op_ffi_api.mod - Add both operators to the parametrized test
- Add edge padding mode - Add auto pad test
…pache#18554) ## Why Resolve todo in `fuse_tir.cc` by enhancing unique block name generation with numeric suffixes
…ep dim (apache#18583) ## Issue 1: Without Dim ### Summary: In _sum function (BaseFXGraphImporter), after retrieve_args, args[1] = [] and still pass into relax.op.sum so the result is incorrect. ### Steps to Reproduce - Module ``` class SumWithoutDim(nn.Module): def forward(self, x): return torch.sum(x) ``` ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2, 3), dtype="float32")): with R.dataflow(): lv: R.Tensor((2, 3), dtype="float32") = R.sum(x, axis=[], keepdims=False) gv: R.Tuple(R.Tensor((2, 3), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: Input: tensor([[1., 1., 1.], [1., 1., 1.]]) Torch output: tensor(6.) Torch output shape: torch.Size([]) TVM output: [[1. 1. 1.] [1. 1. 1.]] TVM output shape: (2, 3) ### Expected ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((), dtype="float32")): with R.dataflow(): lv: R.Tensor((), dtype="float32") = R.sum(x, axis=None, keepdims=False) gv: R.Tuple(R.Tensor((), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: TVM output: 6.0; TVM output shape: () ## Issue 2: Keep Dim ### Summary: In _sum function (BaseFXGraphImporter), previously keepdim value get only from node.kwargs and no pass into relax.op.sum. Now keepdim get more from args[2] and pass into. ### Steps to Reproduce - Module ``` class SumKeepDim(nn.Module): def forward(self, x): return torch.sum(x, dim=1, keepdim=True) ``` ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2,), dtype="float32")): with R.dataflow(): lv: R.Tensor((2,), dtype="float32") = R.sum(x, axis=[1], keepdims=False) gv: R.Tuple(R.Tensor((2,), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: Input: tensor([[1., 1., 1.], [1., 1., 1.]]) Torch output: tensor([[3.], [3.]]) Torch output shape: torch.Size([2, 1]) TVM VM output: [3. 3.] TVM VM output shape: (2,) ### Expected ``` class Module: def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2, 1), dtype="float32")): with R.dataflow(): lv: R.Tensor((2, 1), dtype="float32") = R.sum(x, axis=[1], keepdims=True) gv: R.Tuple(R.Tensor((2, 1), dtype="float32")) = (lv,) R.output(gv) return gv ``` - Result: TVM output: [[3.] [3.]] ;TVM output shape: (2, 1)
…empty vector (apache#18586) As per title.
The ACOS operator was producing incorrect results for boundary values due to poor precision of ASIN's Taylor series expansion near x=±1.0. Root cause: - ASIN used a 6-term Taylor series that converges slowly near boundaries - ACOS was implemented as acos(x) = π/2 - asin(x), inheriting ASIN errors - At x=1.0, ASIN error of 0.354874 (22.6%) caused ACOS to output 0.354874 instead of 0.0 Solution: - Modified ASIN to use system library function (asinf) for |x| >= 0.9 - Modified ACOS to use system library function (acosf) for |x| >= 0.9 - For |x| < 0.9, continue using Taylor series (accurate in this range) This ensures high precision for boundary values while maintaining the existing behavior for values in the middle range. Fixes apache#18580
…avoid undefined symbol on non-QCOM runtimes (apache#18589) This PR is a re-open of apache#18581 The previous PR was created while Jenkins CI was experiencing a disk space issue and the CI job did not trigger. ## PR Description Recent OpenCL-Headers update (KhronosGroup/OpenCL-Headers#277 ) added QCOM perf-hint definitions (`CL_CONTEXT_PERF_HINT_QCOM`, `clSetPerfHintQCOM`) to `cl_ext.h`. These macros are now defined even on platforms whose OpenCL runtimes (e.g., PoCL, ICD loaders) do not implement the QCOM extension. TVM previously enabled the perf-hint code path solely based on the presence of `CL_CONTEXT_PERF_HINT_QCOM`, causing link errors such as: ``` undefined symbol: clSetPerfHintQCOM ``` This PR guards the QCOM perf-hint logic behind `USE_OPENCL_EXTN_QCOM`, matching the behavior of other QCOM-specific OpenCL paths (e.g., `SetNativePtr`). ## Effects Prevents accidental linking against unsupported QCOM symbols on non-QCOM runtimes. Keeps QCOM builds fully functional when `USE_OPENCL_EXTN_QCOM` is explicitly enabled. Aligns TVM’s extension handling across OpenCL code paths. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## How - Implemented InferLayoutRepeat function that: - Preserves layout when axis is specified (with axis transformation) - Returns 1D layout when axis is not specified (flatten mode) - Transforms the axis parameter based on layout changes (e.g., NCHW axis=1 → NHWC axis=3)
## Why - LOG(WARNING) is the standard and correct approach throughout the TVM codebase - The existing pattern is used consistently in all relax ops (see test_op_manipulate.py, index.cc, etc.) - Added test coverage for previously untested scenarios
…pache#18591) Hi @mshr-h @tlopex, This PR is trying to fix issue: DeprecationWarning: invalid escape sequence `\` Any suggestions would be appreciated if you are available. ### Root Cause The backslashes(`\`) inside the docstring <img width="1318" height="455" alt="image" src="https://github.com/user-attachments/assets/ca05ac7d-c598-4ec8-8bd3-a182994cbf9b" /> ### Solution Use a raw docstring(`r"""`) Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
…ult'] (apache#18574) ## Summary Happen error when create module from exported_program have torch.mean without dim. ## Reproduce - Module: ``` class MeanModule(nn.Module): def forward(self, x): return torch.mean(x) ... # Export → Relax ep = torch_export(m, (x,)) mod = from_exported_program(ep) ``` - Error log: ``` --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In[2], line 13 11 # Export → Relax 12 ep = torch_export(m, (x,)) ---> 13 mod = from_exported_program(ep) 15 mod.show() 17 target = "llvm" File ~/Programming/tvm/python/tvm/relax/frontend/torch/exported_program_translator.py:1783, in from_exported_program(exported_program, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple, run_ep_decomposition) 1780 if run_ep_decomposition: 1781 exported_program = exported_program.run_decompositions() -> 1783 return ExportedProgramImporter().from_exported_program( 1784 exported_program, 1785 keep_params_as_input, 1786 unwrap_unit_return_tuple, 1787 no_bind_return_tuple, 1788 ) File ~/Programming/tvm/python/tvm/relax/frontend/torch/exported_program_translator.py:1642, in ExportedProgramImporter.from_exported_program(self, exported_program, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple) 1639 nodes: List[fx.Node] = exported_program.graph.nodes 1641 # Find all the missing function types -> 1642 self._check_unsupported_func_type(nodes) 1644 with self.block_builder.function( 1645 name=func_name, params=list(inputs_vars.values()).copy(), attrs=func_attrs 1646 ): 1647 output = None File ~/Programming/tvm/python/tvm/relax/frontend/torch/base_fx_graph_translator.py:182, in BaseFXGraphImporter._check_unsupported_func_type(self, nodes) 174 def _check_unsupported_func_type(self, nodes: List[fx.Node]): 175 missing_func_types = list( 176 { 177 node.target.__name__ (...) 180 } 181 ) --> 182 assert not missing_func_types, f"Unsupported function types {missing_func_types}" AssertionError: Unsupported function types ['mean.default'] ``` ## Resolve: - Add "mean.default" into create_convert_map in class ExportedProgramImporter.
…st for all shuffle ops
…d TVMFFIABIBuilder (apache#18857)
This PR marks tuning related tests metaschedule tests as skip
## Summary - Remove `FewShotTuning` pass from Relax transform (C++ implementation, Python bindings, and test file) - The pass is unused in the current codebase and can be safely removed ## Files Changed - `include/tvm/relax/transform.h` — Remove declaration - `python/tvm/relax/transform/__init__.py` — Remove from imports - `python/tvm/relax/transform/transform.py` — Remove Python function - `src/relax/transform/few_shot_tuning.cc` — Delete (C++ implementation) - `tests/python/relax/test_transform_few_shot_tuning.py` — Delete (test file)
## Summary - Phase out 12 unused `tir::attr` constants (`scan_*`, `channel_*`, `pipeline_*`, `buffer_bind_scope`, `coproc_*`, `loop_scope`) and remove their dead code paths - Move 11 S-TIR-owned attributes (`async_*`, `double_buffer_*`, `fragment_*`, `pragma_loop_partition_hint`, `reduce_scope`, `virtual_thread`) from `tir::attr` to `s_tir::attr` - Alphabetize the remaining 15 `tir::attr` constants
Enable opencl target for gpu tests. Consolidates all Adreno tests under tests/python/relax/backend/adreno Changes to CLML corresponding to recent changes on json codegen/runtime. Docker specification for Adreno (ci_gpu + Android SDK, Gradle).
…er (apache#18865) ## Summary This PR introduces `AllocBufferNode`/`AllocBuffer` as a single TIR statement that both allocates memory and declares a buffer into scope. This replaces the previous pattern of `Allocate(var, dtype, shape, cond, DeclBuffer(buf, body))` with the simpler `AllocBuffer(buf, body)`. ### Main changes - **New IR node** `AllocBufferNode` with fields `{buffer, annotations, body}` — same semantics as `DeclBuffer` but also allocates memory - **TVMScript**: `T.alloc_buffer(shape, dtype, scope)` now emits `AllocBuffer` directly (statement-level allocation). `T.sblock_alloc_buffer(...)` for SBlock-level buffer allocation (full parameter set) - **All codegen backends** (C, CUDA, Metal, OpenCL, WebGPU, LLVM, NVPTX, AMDGPU, SPIR-V) updated to handle `AllocBufferNode` - **All TIR transforms** (storage_rewrite, flatten_buffer, vectorize_loop, lower_warp_memory, etc.) updated - **All S-TIR transforms** (compact_buffer_region, merge_shared_memory, inject_double_buffer, etc.) updated - **Removed `AllocateNode`** entirely — `AllocBuffer` is now the sole allocation primitive - **Removed `AllocDescriptor`** from merge_shared_memory_allocations — uses `Buffer` objects directly - **Added `AllocBuffer::ConstantAllocationSize()`** inline helper method ### Design rationale The old `Allocate + DeclBuffer` pair was a historical artifact: `AllocateNode` stored raw fields (`buffer_var`, `dtype`, `extents`, `condition`) separate from the `Buffer` object, requiring pattern matching (`IsAllocateDeclBufferPattern`) to reconstruct the buffer association. `AllocBuffer` unifies this into a single node with a proper `Buffer` reference, simplifying codegen backends and transform passes. 225 files changed, ~3500 insertions/deletions (net near-zero, mostly mechanical migration). ## Test plan - [x] All TIR base tests pass - [x] All TIR transform tests pass - [x] TVMScript roundtrip tests pass - [x] S-TIR transform tests pass - [x] Codegen tests pass - [x] All-platform minimal tests pass - [x] C++ functor tests pass - [x] Pre-commit clean (clang-format, ruff, etc.)
…ntics (apache#18874) ## Summary Rename `LetStmtNode`/`LetStmt` to `BindNode`/`Bind` and remove the `body` field. The variable defined by `Bind(var, value)` is now visible in all subsequent statements within the same enclosing scope, rather than being scoped to a nested body. This flattens deeply nested let-chains into sequential `SeqStmt([Bind(...), Bind(...), ...])`, making the IR easier to read, transform, and analyze. ## Key Changes - **New `BindNode`**: `{var, value}` — no body field. Variable scope is the enclosing statement's body (For, IfThenElse, AllocBuffer, etc.) - **ScopeStack pattern**: Passes that need scope-aware cleanup (ConvertSSA, CSE, tir_visitor_with_path) use `ScopeStack` instead of manual save/restore or RAII wrappers - **All passes migrated**: 89 files updated across codegen backends, TIR transforms, S-TIR transforms, analyses, TVMScript printer/parser/ir_builder
…8875) ## Summary - Fix Markdown link syntax for build.sh. - Correct docker/bash.sh usage examples to use proper image names and shortcuts. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
) ## Summary - Batch compute dispatches into a single GPUCommandEncoder, flushing on sync/readback instead of per-dispatch submit to reduce JS↔GPU transition overhead during LLM decode - Cache uniform buffers (FIFO/512), bind groups (FIFO/256), shape tuples, and pool MAP_READ staging buffers to eliminate redundant GPU object creation - Fix padding self-assignment bug in `deviceCopyToGPU`
## Summary Support dynamic `repeats` for ONNX Tile in the Relax frontend. ## Changes - add a dynamic Tile conversion path for ONNX when `repeats` is a graph input - expose `topi.dyn_tile` to the Python/packed TOPI interface - add frontend tests for dynamic `repeats` ## Validation - `tests/python/relax/test_frontend_onnx.py -k test_tile_dynamic_repeats -q` - local end-to-end repro matches ONNX Runtime ## Issue Fixes apache#18752
…8876) ## Summary - Remove `body` field from `AllocBufferNode` and `DeclBufferNode`, making them flat statements consistent with `Bind` - Buffer scope extends to end of enclosing scope via flat `SeqStmt` semantics - 60 files changed across core IR, codegen backends, transforms, script IR builder, and tests ## Test plan - All existing test suites pass (tir-transform, tir-base, tvmscript, s_tir, codegen, C++)
apache#18881) ## Summary - Fix `inject_texture_alloc.cc` to replace `AllocBuffer` with just a `Bind(nd_mem_alloc_with_scope)`, consistent with `LowerVtcmAlloc` - Previously kept a redundant `AllocBuffer` alongside the `Bind` in a `SeqStmt` Follow-up to apache#18876.
…generated `feature.*` attrs (apache#18883) Fix apache#18882 `TargetNode::ToConfig()` exports all target attrs, including derived `feature.*` fields set by target canonicalizers. However, `TargetInternal::FromConfig()` rejects these keys during schema validation because they are not declared in the target kind schema. This breaks round-tripping exported configs through `Target(config)`. This PR strips `feature.*` keys from the config before `ConfigSchema::Resolve`, then merges them back afterward. Canonicalizer output is authoritative — if the canonicalizer re-emits a `feature.*` key, it overwrites the preserved value. Unknown non-`feature.*` keys continue to fail validation as before. Changes: - src/target/target.cc: Extract and re-merge `feature.*` keys around schema resolution in `FromConfig()` - tests/cpp/target_test.cc: Add tests for single-target round-trip, nested-host round-trip, and continued rejection of unknown non-feature keys --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ache#18887) ## Summary - Fix `gpu_2d_continuous_cumsum` using `T.sblock_alloc_buffer` for `Tmp` buffer that is used across multiple kernel launches (not within a single sblock). Changed to `T.alloc_buffer`. - `T.sblock_alloc_buffer` places the buffer in SBlock metadata, making subsequent references to buffer dimensions (used by `ceil_log2`) undefined after the AllocBuffer/DeclBuffer refactor. Fixes apache#18885
## Summary
This PR do a rebuild of TIR Common Subexpression Elimination (CSE) using
a two-phase architecture:
- **Phase 1 — CSEPlanner**: Read-only visitor that builds a scope tree
and expression DAG. Computes a plan (InsertBeforeTable + ExprRemapTable)
in a single pass using shallower-first processing with repr propagation
— no cascade loop needed.
- **Phase 2 — CSERewriter**: Mechanical mutator that inserts
`Bind(cse_var, expr)` statements and substitutes expressions per the
plan.
Key improvements over the old implementation:
- **Simpler architecture**: Two clean classes (planner + rewriter)
instead of interleaved analysis/mutation
- **No cascade loop**: Shallower-first processing with repr propagation
resolves all CSE opportunities in one plan + one rewrite
- **Incremental DAG construction**: Expression depth, children, and
consumed counts computed during bottom-up scan — no separate traversals
- **No single-use bindings**: Consumed count tracking avoids introducing
bindings that would only be used once
- **Unified insertion via VisitStmt**: SeqStmt flattening handles all
insertion contexts uniformly
Other changes:
- Rename `CommonSubexprElimTIR` → `CommonSubexprElim`, remove
`enable_cse_tir` and `identify_equiv_terms` params
- Move old CSE tools (used by cache_index) to
`cache_index_helpers.{cc,h}`
- Remove unused `arith.detect_common_subexpr` API
- Add `T.bind` as lowercase alias for `T.Bind`
…on (apache#18889) ## Summary - Rename `T.Bind` (capitalized) to `T.bind` (lowercase) to match TVMScript naming convention: statement builders use lowercase (`T.evaluate`, `T.buffer_store`, `T.bind`), expression constructors use capitalized (`T.Cast`, `T.Select`, `T.Let`) - Keep `Bind = bind` backward-compat alias - Update parser, printer references, and all test files ## Test plan - [x] tvmscript tests (771 passed) - [x] tir-transform tests (346 passed) - [x] tir-base tests (224 passed) - [x] pre-commit lint passes
## Summary Reject non-floating inputs for trig-style TIR unary ops. ## Changes - reject non-floating inputs for trig-style TIR unary ops such as `tan`, `sin`, and `cos` - add the same dtype check in the Python TIR wrapper so `topi.tan(int32)` fails early with a clear `TypeError` - add regression tests for `tvm.tir.tan(int32)` and `topi.tan(int32)` ## Validation - `tests/python/tir-base/test_tir_constructor.py -k 'math_unary_constructor_requires_float_dtype or topi_tan_requires_float_dtype' -q` - local repro for the original `where -> tan(int32)` case now fails early with `TypeError` - verified `topi.tan(float32)` still builds with `target="llvm"` ## Issue Fixes apache#18769
## Summary Reject non-float inputs for inverse trigonometric and hyperbolic unary ops in TOPI. ## Changes - add a shared floating-point dtype check for inverse unary math ops in TOPI - apply the check to `topi.acos`, `topi.acosh`, `topi.asin`, `topi.asinh`, and `topi.atanh` - add TE tests covering integer-input rejection for these ops - add regression tests covering successful LLVM build for both `float32` and `bfloat16` ## Validation - `tests/python/te/test_te_create_primfunc.py -k 'topi_float_unary'` - local repro now fails early with a clear `TypeError` for integer inputs - local regression check confirms the valid `float32` and `bfloat16` paths still compile with LLVM ## Issue Fixes apache#18729
## Summary - Remove `Bind = bind` backward-compat alias from `ir.py` - Remove `"Bind"` from `__all__` exports - Follows apache#18889 which renamed `T.Bind` → `T.bind` ## Test plan - [x] tvmscript roundtrip/printer/ir_builder tests pass (232 passed) - [x] pre-commit lint passes
…pache#18892) Add ^ anchor to the version regex so it matches only the top-level `version = "..."` instead of all three occurrences, which caused hit_count == 3 and a RuntimeError in sync_version.
…rating processor descriptions (apache#18884) ## Summary Fix false rejection of `apple-m1`, `apple-m2`, and `apple-m3` as LLVM CPU names when building TVM with LLVM 22+. ## Behavior After following the [installation from source instructions](https://tvm.apache.org/docs/install/from_source.html) and building against LLVM 22, every `import tvm` produces spurious error messages: ``` Error: Using LLVM 22.1.0 with `-mcpu=apple-m1` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` Error: Using LLVM 22.1.0 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic` ``` These are triggered by the Metal target tag registrations in `python/tvm/target/tag_registry/metal.py`, which use `apple-m1` and `apple-m2` as the host `-mcpu`. The CPUs are silently downgraded to `generic`. ## Root cause LLVM 22 reorganized its AArch64 processor table. `apple-m1` through `apple-m3` are now CPU **aliases** — fully valid and accepted by `createTargetMachine` and `isCPUStringValid()`, but no longer returned by `MCSubtargetInfo::getAllProcessorDescriptions()`. TVM's `LLVMTargetInfo` constructor validates `-mcpu` by enumerating `getAllProcessorDescriptions()` and checking membership, so it misses alias-only names. ## Fix Replace the enumeration-based check with a new `IsValidCPU()` method that uses `MCSubtargetInfo::isCPUStringValid()`, which correctly handles both primary names and aliases. This API has been available since at least LLVM 7, well before TVM's minimum supported version. ## Validation - Built and tested on macOS (Apple Silicon) with LLVM 22.1.0 - `python -c "import tvm; print(tvm.__file__)"` produces clean output with no error messages --------- Co-authored-by: Gabriel Guralnick <gabriel@imbue.com>
…xportedProgram importer (apache#18903) Fixes `TypeError: 'NoneType' object is not iterable` when importing models with dynamic batch dimensions that contain identity slices (e.g., `x[:, :H, :W, :]` on a dynamic batch dim). **Root cause:** `aten.slice.Tensor(x, 0, 0, INT_MAX)` (an identity slice on a dynamic dim `s`) produces a result with shape `[T.min(INT_MAX, s), ...]` instead of `[s, ...]`. When this is combined with the original tensor via `add`, TVM cannot unify the shapes, resulting in `struct_info.shape = None`. Any subsequent `view`/`reshape` then crashes calling `list(None)`. This pattern appears in models like `swin_t`, where shifted window attention crops padded features with `x[:, :H, :W, :].contiguous()`. **Changes:** - `exported_program_translator.py`: Skip `strided_slice` for identity slices (`start=0, end>=INT_MAX, step=1`) and return the input tensor directly. - `base_fx_graph_translator.py`: Guard the identity-reshape check in `_reshape` against `None` shape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds gating logic on top of apache#17699 to support optional subgroup shuffle
primitives based on a compile-time flag.
Problem
The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU.
However, not all WebGPU devices support subgroups. We need a way to:
Solution
Implement gating via TVM target parameter:
thread_warp_size=1disables warp reductions (uses shared memory + barriers)UpdateWebGPUAttrs()that setsthread_warp_size=32whensupports_subgroups=true--enable-subgroupsCLI flag in mlc-llm to surface the option to usersThe gating happens at the reduction path selection level (
IsWarpReduction()inlower_thread_allreduce.cc), ensuring subgroup ops are never generated unless explicitly enabled.Changes
Testing
Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions;
with flag, generates subgroupShuffle* ops.
Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f