Skip to content

[WebGPU] Add gating logic for subgroup shuffle primitives#1

Open
ksgr5566 wants to merge 793 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566:webgpu-subgroup-test
Open

[WebGPU] Add gating logic for subgroup shuffle primitives#1
ksgr5566 wants to merge 793 commits intoCharlieFRuan:pr-0302-webgpu-shufflefrom
ksgr5566:webgpu-subgroup-test

Conversation

@ksgr5566
Copy link

@ksgr5566 ksgr5566 commented Feb 24, 2026

Summary

This adds gating logic on top of apache#17699 to support optional subgroup shuffle
primitives based on a compile-time flag.

Problem

The PR apache#17699 always generates subgroup shuffle ops when targeting WebGPU.
However, not all WebGPU devices support subgroups. We need a way to:

  • Default to shared memory reductions (universally compatible)
  • Optionally enable subgroup shuffles for devices that support them

Solution

Implement gating via TVM target parameter:

  • Default thread_warp_size=1 disables warp reductions (uses shared memory + barriers)
  • Add target parser UpdateWebGPUAttrs() that sets thread_warp_size=32 when supports_subgroups=true
  • Add --enable-subgroups CLI flag in mlc-llm to surface the option to users

The gating happens at the reduction path selection level (IsWarpReduction() in
lower_thread_allreduce.cc), ensuring subgroup ops are never generated unless explicitly enabled.

Changes

Testing

Tested with Llama-3.2-1B-q4f16_1. Baseline (no flag) uses shared memory reductions;
with flag, generates subgroupShuffle* ops.
Both the generated WGSLs here: https://gist.github.com/ksgr5566/301664a5dda3e46f44092be4d09b2d4f

guan404ming and others added 30 commits November 30, 2025 19:13
…pache#18534)

Add __init__ method to SearchStrategy class to prevent direct
instantiation of the abstract class. This raises a TypeError with a
helpful error message instead of causing a segmentation fault when
SearchStrategy() is called directly or passed to TuneContext.

Also add additional check in TuneContext.__init__ to ensure abstract
SearchStrategy instances are not used.

Fixes apache#18268
## Related Issue

closes apache#18355

## Why

Converting PyTorch operations like M[:, rows, cols] = x failed because:
1. The TOPI index_put implementation called len() on TVM Tensor objects
(unsupported)
2. Index tensors with different shapes (e.g., (2,) and (10,)) couldn't
broadcast together

## How

- Added broadcasting support following NumPy rules to handle
multi-dimensional index tensors
- add tests for batched indexing pattern M[:, rows, cols] = x
…#18536)

Hi Commiters,

This PR is trying to fix issues
apache#17936. Any suggestions would be
appreciated if you are available.

### Root Cause

Code paths that expected vector evaluation but encounter the scalar-only
evaluation `eval_vec_ = false`

### Solution

`BroadcastNode` just replicates the same scalar value and the evaluation
might not requires special vector-aware handling

Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
This PR is to resolve the issue apache#18481 , which fixes two bugs in the
end-to-end optimization tutorial
(`docs/how_to/tutorials/e2e_opt_model.py`) that prevented it from
running correctly on GPU devices.

### Changes

1. **Added DefaultGPUSchedule transformation**
- Apply `DefaultGPUSchedule` to ensure all GPU functions have proper
thread binding. This fixes the memory verification error: "`Variable is
directly accessed by host memory... Did you forget to bind?`"
  

2. **Fixed VM output handling**
   - Updated to correctly extract tensor from VM output.
…) is false (apache#18525)

This commit fixes tvm.error.InternalError: Check failed:
(index_map_func.has_value()) is false in
[apache#18472](apache#18472)

**Why**
When using mma for MultiLevelTilingTensorCore, users must manually pass
tvm.tir.tensor_intrin as an initializer to register it in LocalBuilder.
This is inconsistent with the wmma workflow, where tvm.tir.tensor_intrin
is imported by default in
[tune_context.py](https://github.com/apache/tvm/blob/main/python/tvm/meta_schedule/tune_context.py#L109)
to ensure that the TensorIntrin required by wmma is registered in
advance. Additionally, the corresponding error message is not
straightforward, which can be confusing for new users who are not
familiar with TVM.

**How**
by adding import tensor_intrin in the default_build

---------

Co-authored-by: Balint Cristian <cristian.balint@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
When forcing the use of MMA with MultiLevelTilingTensorCore or directly
applying tensorization via the script below, the required shared memory
size is significantly overestimated compared to the actual usage, at the
same time, the accumulated result of mma is also incorrect. This issue
stems from two root causes:

1. In `MmaToGlobal::Rewrite`, an extra threadIdx.x dimension is
introduced when calling InsertCacheStage, which confuses the memory
analysis and leads to inflated shared memory estimates.
2. In `get_mma_sync_intrin`, the offset computation for fragment C in
get_index_C is incorrect, resulting in erroneous accumulation results.

This PR addresses both issues to ensure accurate shared memory
estimation and correct tensor core accumulation behavior.

**How**
This PR includes the following fixes:

1. Skip the threadIdx.x dimension in `InsertCacheStage` when it is not
required, to prevent spurious shared memory overestimation and store
repeatedly.
2. Correct the offset calculation for fragment C in `get_index_C` to
ensure accurate accumulation results during tensor core execution.

**Result**
The above script produces results that match those of PyTorch.

**Env**
NVIDIA A100-SXM4-80GB
## Why

- The interpolate operation was hardcoded to only support NCHW layout
- Users need flexibility to choose the appropriate layout for their
target platform

## How

- Added default_image_layout parameter
- Exposed default_image_layout parameter in the public from_fx()
## Why

remove hard-code legacy code in ci
## Why

- resolve todo in
[test_op_gradient_numeric.py](https://github.com/apache/tvm/compare/main...guan404ming:update-conv2d-test?expand=1#diff-65bec2fe9ca46b486e6e1d3412e9092d25d3815bb6173435501bbfab7eefd87b)
by unifying the dtype used in conv2d related test
- use float32 with reduced range [0, 3] to maintain numerical precision
for gradient checking
…he#18550)

## Why

Fixes interpolation to support different scaling factors for height and
width (e.g., scale_factor=[2.0, 3.0])

## How

- Removed the bug: Stopped extracting just the first element ([0]) from
scale_factor lists
- Passed full value: Now passes the entire scale_factor (scalar or list)
to the underlying implementation, which already handles both correctly
…ON` and MLIR >= 15.0 (apache#18555)

Fix a runtime error that occurs when TVM built with `USE_MLIR=ON` and
MLIR >= 15.0, as shown below.

```
Traceback (most recent call last):
  File "<unknown>", line 0, in tvm::arith::__TVMFFIStaticInitFunc2()
  File "<unknown>", line 0, in tvm::ffi::reflection::ObjectDef<tvm::arith::PresburgerSetNode>::ObjectDef<>()
  File "<unknown>", line 0, in void tvm::ffi::reflection::ObjectDef<tvm::arith::PresburgerSetNode>::RegisterExtraInfo<>()
  File "build/src/ffi/object.cc", line 500, in TVMFFITypeRegisterMetadata
  File "src/ffi/object.cc", line 240, in void tvm::ffi::TypeTable::RegisterTypeMetadata(int32_t, const TVMFFITypeMetadata *)
RuntimeError: Overriding arith.PresburgerSet, possible causes:
- two ObjectDef<T>() calls for the same T 
- when we forget to assign _type_key to ObjectRef<Y> that inherits from T
- another type with the same key is already registered
Cross check the reflection registration.
libc++abi: terminating due to uncaught exception of type tvm::ffi::Error
```
… to commit c71aefc) (apache#18545)

- Expose max_trials_per_task parameter to static_shape_tuning_pipeline
- Adjust default TOTAL_TRIALS from 8000 to 80 for tutorial demonstration
purposes
- Add documentation for tuning parameters in tutorial, clarifying
relationship between MAX_TRIALS_PER_TASK and TOTAL_TRIALS
The FlashInfer attention plan function introduced a new parameter of
`num_colocated_ctas`. This commit updates the TVM caller side
accordingly.
## Description
This PR fixes a conversion bug that occurs when performing operations on
`bfloat16` tensors.
In conclusion, when applying the `BF16ComputeLegalize` compile pass and
visiting a `BufferStoreNode`, if the stored value's dtype is different
from the buffer's, `DTypeConversion()` should be used instead of a
simple `cast` to apply the appropriate conversion logic.


## Test
I added a test for this situation based on the existing tests.
With the fix, `B[i] = A[i]` turns into `B[i] = bf16tof32(A[i])`
properly, so the test passes.

I'm not really sure whether the structure or name of this added test is
appropriate.
So let me gladly modify it if there is any comment on this.

## Process
### Problem observed
This bug was identified when applying `nn.Linear()` to a `bfloat16`
tensor resulted in excessively large numbers.
While it appears to exist in other operations as well, it's particularly
noticeable when the inner dimension of `MatMul` is a multiple of
`8`(`16` for CUDA and ROCm).
#### Example of problematic code
```python
from ml_dtypes import bfloat16
import numpy as np
from tvm.relax.frontend import nn
from tvm.relax.frontend.nn import Tensor, op
from tvm.target import Target

n = 10
INNER_DIM = 8 * n   # if INNER_DIM is a multiple of 8

class TestModule(nn.Module):
  def __init__(self):
    self.weight = nn.Parameter((32, INNER_DIM), dtype=dtype)

  def run(self, x: Tensor):
    t = op.matmul(self.weight, x, out_dtype=dtype)
    return t

  def get_default_spec(self):
    mod_spec = {
      "run": {
        "x": nn.spec.Tensor([INNER_DIM, 100], dtype),
        "$": {
          "param_mode": "packed",
          "effect_mode": "none",
        },
      },
    }
    return nn.spec.ModuleSpec.from_raw(mod_spec, self)

def compile_module(...):
  ...

def main():
  target = "metal" # or "cuda", "vulkan", ...
  model = TestModule()
  ex, _ = compile_module(model, target)
  device = tvm.device(target, 0)
  vm = create_vm(ex, device=device)
  frun = vm["run"]
  
  params = []
  param = tvm.runtime.empty(
    (32, INNER_DIM),
    dtype="bfloat16",
    device=device,
  )
  param.copyfrom(np.ones((32, INNER_DIM), dtype=bfloat16))
  params.append(param)
  
  inputs = np.ones((INNER_DIM, 100), dtype=bfloat16)
  arr = frun(inputs, params)
  print(f"{arr=}")  # arr has weird values!
```

In cases where the inner dimension is not a multiple of `8`(or `16`),
the issue was avoided by applying `T.if_then_else()` through
`PadEinsum`. `PadEinsum` itself wasn't a troublemaker, and rather helped
identify the issue.

### Problem Identified
I could see the problems were avoided by wrapping an expression with
`T.if_then_else()` or `T.cast()` before applying `BF16ComputeLegalize`
compile pass.

#### Statement with problem
```python
weight_reindex_shared[v0, v1, v2] = weight[v1, v2]
```
#### Statements without problem
```python
# 1) wrapped with T.if_then_else()
weight_reindex_pad_shared[v0, v1, v2] = T.if_then_else(v2 < 511, weight[v1, v2], T.bfloat16(0.0))
# 2) wrapped with T.Cast()
weight_reindex_pad_shared[v0, v1, v2] = T.Cast("float32", weight[v1, v2])
# ...
```

In the `BF16ComputeLegalize` compile pass, if a specific `Expr`(here,
`weight[...]`) is processed through `PromoteToTarget()`(eventually,
`DTypeConversion()`), the syntax changes to the syntax below(TO-BE),
which applies the conversion logic. While the problematic statement
simply applies `T.Cast()`(AS-IS).

#### AS-IS
```python
T.Cast("float32", weight[...])
```
#### TO-BE
```python
T.reinterpret("float32", T.shift_left(T.Cast("uint32", T.reinterpret("uint16", weight[...])), T.uint32(16)))
```

### Fixing the problem
This situation is caused by L332 in the code below. Changing this part
to apply `DTypeConversion()` instead of `cast()` will resolve the issue.
(In the cases that the `Expr` is wrapped with `T.if_then_else()` or
something else, the `Expr` is processed properly in other visit
functions through L312 or L313. So the problems were avoided.)

#### L332
```diff
- value = cast(new_buf->dtype.with_lanes(value.dtype().lanes()), value);
+ value = DTypeConversion(value, new_buf->dtype.with_lanes(value.dtype().lanes()));
```


https://github.com/apache/tvm/blob/26b107fa12672c3b958da222fc87755a69d64c42/src/tir/transforms/unsupported_dtype_legalize.cc#L311-L338
…end (apache#18544)

As per title.

cc @tlopex @guan404ming 

We keep the interface same as
[`from_fx()`](https://github.com/apache/tvm/blob/ed97234b25a155bc66198ab5cd9e372a4772acec/python/tvm/relax/frontend/torch/fx_translator.py#L1152)
so you can define and pass custom converter something like this.

```python
from tvm.relax.frontend.torch.exported_program_translator import ExportedProgramImporter
def _rms_norm_converter(node: torch.fx.Node, self: ExportedProgramImporter) -> relax.Var:
    x = self.env[node.args[0]]
    torch_dtype = node.args[0].meta["tensor_meta"].dtype
    normalized_shape = node.args[1]
    weight = self.env.get(node.args[2], None) if len(node.args) > 2 else None
    eps = node.args[3] if len(node.args) > 3 else None

    N = len(self.shape_of(x))
    D = len(normalized_shape) if isinstance(normalized_shape, (tuple, list)) else 1
    axes = list(range(N - D, N))

    if weight is None:
        weight = self._convert_torch_tensor_to_relax(
            torch.ones(list(normalized_shape), dtype=torch_dtype)
        )
    eps = torch.finfo(torch_dtype).eps if eps is None else 0.00001

    return self.block_builder.emit(relax.op.nn.rms_norm(x, weight, axes, eps))

mod = from_exported_program(
    exported_program,
    custom_convert_map={"rms_norm.default": _rms_norm_converter},
    run_ep_decomposition=False,
)
## How

- Resolve todo by changing from raising error to calling _op_ffi_api.mod
- Add both operators to the parametrized test
- Add edge padding mode
- Add auto pad test
…pache#18554)

## Why

Resolve todo in `fuse_tir.cc` by enhancing unique block name generation
with numeric suffixes
…ep dim (apache#18583)

## Issue 1: Without Dim
### Summary:
In _sum function (BaseFXGraphImporter), after retrieve_args, args[1] =
[] and still pass into relax.op.sum so the result is incorrect.
### Steps to Reproduce
- Module
```
class SumWithoutDim(nn.Module):
    def forward(self, x):
        return torch.sum(x)
```
```
class Module:
    def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2, 3), dtype="float32")):
        with R.dataflow():
            lv: R.Tensor((2, 3), dtype="float32") = R.sum(x, axis=[], keepdims=False)
            gv: R.Tuple(R.Tensor((2, 3), dtype="float32")) = (lv,)
            R.output(gv)
        return gv
```
- Result:

Input: tensor([[1., 1., 1.], [1., 1., 1.]])
Torch output: tensor(6.)
Torch output shape: torch.Size([])
TVM output: [[1. 1. 1.]  [1. 1. 1.]]
TVM output shape: (2, 3)
### Expected
```
class Module:
    def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((), dtype="float32")):
        with R.dataflow():
            lv: R.Tensor((), dtype="float32") = R.sum(x, axis=None, keepdims=False)
            gv: R.Tuple(R.Tensor((), dtype="float32")) = (lv,)
            R.output(gv)
        return gv
```
- Result: TVM output: 6.0; TVM output shape: ()

## Issue 2: Keep Dim
### Summary:
In _sum function (BaseFXGraphImporter), previously keepdim value get
only from node.kwargs and no pass into relax.op.sum. Now keepdim get
more from args[2] and pass into.
### Steps to Reproduce
- Module
```
class SumKeepDim(nn.Module):
    def forward(self, x):
        return torch.sum(x, dim=1, keepdim=True)
```
```
class Module:
    def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2,), dtype="float32")):
        with R.dataflow():
            lv: R.Tensor((2,), dtype="float32") = R.sum(x, axis=[1], keepdims=False)
            gv: R.Tuple(R.Tensor((2,), dtype="float32")) = (lv,)
            R.output(gv)
        return gv

```
- Result:

Input: tensor([[1., 1., 1.], [1., 1., 1.]])
Torch output: tensor([[3.], [3.]])
Torch output shape: torch.Size([2, 1])
TVM VM output: [3. 3.]
TVM VM output shape: (2,)
### Expected
```
class Module:
    def main(x: R.Tensor((2, 3), dtype="float32")) -> R.Tuple(R.Tensor((2, 1), dtype="float32")):
        with R.dataflow():
            lv: R.Tensor((2, 1), dtype="float32") = R.sum(x, axis=[1], keepdims=True)
            gv: R.Tuple(R.Tensor((2, 1), dtype="float32")) = (lv,)
            R.output(gv)
        return gv
```
- Result: TVM output: [[3.] [3.]] ;TVM output shape: (2, 1)
The ACOS operator was producing incorrect results for boundary values
due to poor precision of ASIN's Taylor series expansion near x=±1.0.

Root cause:
- ASIN used a 6-term Taylor series that converges slowly near boundaries
- ACOS was implemented as acos(x) = π/2 - asin(x), inheriting ASIN
errors
- At x=1.0, ASIN error of 0.354874 (22.6%) caused ACOS to output
0.354874 instead of 0.0

Solution:
- Modified ASIN to use system library function (asinf) for |x| >= 0.9
- Modified ACOS to use system library function (acosf) for |x| >= 0.9
- For |x| < 0.9, continue using Taylor series (accurate in this range)

This ensures high precision for boundary values while maintaining the
existing behavior for values in the middle range.

Fixes apache#18580
…avoid undefined symbol on non-QCOM runtimes (apache#18589)

This PR is a re-open of apache#18581 

The previous PR was created while Jenkins CI was experiencing a disk
space issue and the CI job did not trigger.

## PR Description
Recent OpenCL-Headers update
(KhronosGroup/OpenCL-Headers#277
) added QCOM perf-hint definitions (`CL_CONTEXT_PERF_HINT_QCOM`,
`clSetPerfHintQCOM`) to `cl_ext.h`.

These macros are now defined even on platforms whose OpenCL runtimes
(e.g., PoCL, ICD loaders) do not implement the QCOM extension.

TVM previously enabled the perf-hint code path solely based on the
presence of `CL_CONTEXT_PERF_HINT_QCOM`, causing link errors such as:

```
undefined symbol: clSetPerfHintQCOM
```

This PR guards the QCOM perf-hint logic behind `USE_OPENCL_EXTN_QCOM`,
matching the behavior of other QCOM-specific OpenCL paths (e.g.,
`SetNativePtr`).

## Effects
Prevents accidental linking against unsupported QCOM symbols on non-QCOM
runtimes.
Keeps QCOM builds fully functional when `USE_OPENCL_EXTN_QCOM` is
explicitly enabled.
Aligns TVM’s extension handling across OpenCL code paths.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
## How

- Implemented InferLayoutRepeat function that:
  - Preserves layout when axis is specified (with axis transformation)
  - Returns 1D layout when axis is not specified (flatten mode)
- Transforms the axis parameter based on layout changes (e.g., NCHW
axis=1 → NHWC axis=3)
## Why

- LOG(WARNING) is the standard and correct approach throughout the TVM
codebase
- The existing pattern is used consistently in all relax ops (see
test_op_manipulate.py, index.cc, etc.)
- Added test coverage for previously untested scenarios
…pache#18591)

Hi @mshr-h @tlopex,

This PR is trying to fix issue: DeprecationWarning: invalid escape
sequence `\`
Any suggestions would be appreciated if you are available.

### Root Cause

The backslashes(`\`) inside the docstring

<img width="1318" height="455" alt="image"
src="https://github.com/user-attachments/assets/ca05ac7d-c598-4ec8-8bd3-a182994cbf9b"
/>


### Solution
Use a raw docstring(`r"""`)

Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
…ult'] (apache#18574)

## Summary
Happen error when create module from exported_program have torch.mean
without dim.
## Reproduce
- Module:
```
class MeanModule(nn.Module):
    def forward(self, x):
        return torch.mean(x)
...
# Export → Relax
ep = torch_export(m, (x,))
mod = from_exported_program(ep)
```
- Error log:
```
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 13
     11 # Export → Relax
     12 ep = torch_export(m, (x,))
---> 13 mod = from_exported_program(ep)
     15 mod.show()
     17 target = "llvm"

File ~/Programming/tvm/python/tvm/relax/frontend/torch/exported_program_translator.py:1783, in from_exported_program(exported_program, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple, run_ep_decomposition)
   1780 if run_ep_decomposition:
   1781     exported_program = exported_program.run_decompositions()
-> 1783 return ExportedProgramImporter().from_exported_program(
   1784     exported_program,
   1785     keep_params_as_input,
   1786     unwrap_unit_return_tuple,
   1787     no_bind_return_tuple,
   1788 )

File ~/Programming/tvm/python/tvm/relax/frontend/torch/exported_program_translator.py:1642, in ExportedProgramImporter.from_exported_program(self, exported_program, keep_params_as_input, unwrap_unit_return_tuple, no_bind_return_tuple)
   1639 nodes: List[fx.Node] = exported_program.graph.nodes
   1641 # Find all the missing function types
-> 1642 self._check_unsupported_func_type(nodes)
   1644 with self.block_builder.function(
   1645     name=func_name, params=list(inputs_vars.values()).copy(), attrs=func_attrs
   1646 ):
   1647     output = None

File ~/Programming/tvm/python/tvm/relax/frontend/torch/base_fx_graph_translator.py:182, in BaseFXGraphImporter._check_unsupported_func_type(self, nodes)
    174 def _check_unsupported_func_type(self, nodes: List[fx.Node]):
    175     missing_func_types = list(
    176         {
    177             node.target.__name__
   (...)    180         }
    181     )
--> 182     assert not missing_func_types, f"Unsupported function types {missing_func_types}"

AssertionError: Unsupported function types ['mean.default']
```

## Resolve: 
- Add "mean.default" into create_convert_map in class
ExportedProgramImporter.
ksgr5566 and others added 30 commits March 1, 2026 14:44
This PR marks tuning related tests metaschedule tests as skip
## Summary

- Remove `FewShotTuning` pass from Relax transform (C++ implementation,
Python bindings, and test file)
- The pass is unused in the current codebase and can be safely removed

## Files Changed

- `include/tvm/relax/transform.h` — Remove declaration
- `python/tvm/relax/transform/__init__.py` — Remove from imports
- `python/tvm/relax/transform/transform.py` — Remove Python function
- `src/relax/transform/few_shot_tuning.cc` — Delete (C++ implementation)
- `tests/python/relax/test_transform_few_shot_tuning.py` — Delete (test
file)
## Summary

- Phase out 12 unused `tir::attr` constants (`scan_*`, `channel_*`,
`pipeline_*`, `buffer_bind_scope`, `coproc_*`, `loop_scope`) and remove
their dead code paths
- Move 11 S-TIR-owned attributes (`async_*`, `double_buffer_*`,
`fragment_*`, `pragma_loop_partition_hint`, `reduce_scope`,
`virtual_thread`) from `tir::attr` to `s_tir::attr`
- Alphabetize the remaining 15 `tir::attr` constants
Enable opencl target for gpu tests.
Consolidates all Adreno tests under tests/python/relax/backend/adreno
Changes to CLML corresponding to recent changes on json codegen/runtime.
Docker specification for Adreno (ci_gpu + Android SDK, Gradle).
…er (apache#18865)

## Summary

This PR introduces `AllocBufferNode`/`AllocBuffer` as a single TIR
statement that both allocates memory and declares a buffer into scope.
This replaces the previous pattern of `Allocate(var, dtype, shape, cond,
DeclBuffer(buf, body))` with the simpler `AllocBuffer(buf, body)`.

### Main changes

- **New IR node** `AllocBufferNode` with fields `{buffer, annotations,
body}` — same semantics as `DeclBuffer` but also allocates memory
- **TVMScript**: `T.alloc_buffer(shape, dtype, scope)` now emits
`AllocBuffer` directly (statement-level allocation).
`T.sblock_alloc_buffer(...)` for SBlock-level buffer allocation (full
parameter set)
- **All codegen backends** (C, CUDA, Metal, OpenCL, WebGPU, LLVM, NVPTX,
AMDGPU, SPIR-V) updated to handle `AllocBufferNode`
- **All TIR transforms** (storage_rewrite, flatten_buffer,
vectorize_loop, lower_warp_memory, etc.) updated
- **All S-TIR transforms** (compact_buffer_region, merge_shared_memory,
inject_double_buffer, etc.) updated
- **Removed `AllocateNode`** entirely — `AllocBuffer` is now the sole
allocation primitive
- **Removed `AllocDescriptor`** from merge_shared_memory_allocations —
uses `Buffer` objects directly
- **Added `AllocBuffer::ConstantAllocationSize()`** inline helper method

### Design rationale

The old `Allocate + DeclBuffer` pair was a historical artifact:
`AllocateNode` stored raw fields (`buffer_var`, `dtype`, `extents`,
`condition`) separate from the `Buffer` object, requiring pattern
matching (`IsAllocateDeclBufferPattern`) to reconstruct the buffer
association. `AllocBuffer` unifies this into a single node with a proper
`Buffer` reference, simplifying codegen backends and transform passes.

225 files changed, ~3500 insertions/deletions (net near-zero, mostly
mechanical migration).

## Test plan

- [x] All TIR base tests pass
- [x] All TIR transform tests pass
- [x] TVMScript roundtrip tests pass
- [x] S-TIR transform tests pass
- [x] Codegen tests pass
- [x] All-platform minimal tests pass
- [x] C++ functor tests pass
- [x] Pre-commit clean (clang-format, ruff, etc.)
…ntics (apache#18874)

## Summary

Rename `LetStmtNode`/`LetStmt` to `BindNode`/`Bind` and remove the
`body` field.
The variable defined by `Bind(var, value)` is now visible in all
subsequent
statements within the same enclosing scope, rather than being scoped to
a nested body.

This flattens deeply nested let-chains into sequential
`SeqStmt([Bind(...), Bind(...), ...])`,
making the IR easier to read, transform, and analyze.

## Key Changes

- **New `BindNode`**: `{var, value}` — no body field. Variable scope is
the enclosing
  statement's body (For, IfThenElse, AllocBuffer, etc.)
- **ScopeStack pattern**: Passes that need scope-aware cleanup
(ConvertSSA, CSE,
tir_visitor_with_path) use `ScopeStack` instead of manual save/restore
or RAII wrappers
- **All passes migrated**: 89 files updated across codegen backends, TIR
transforms,
  S-TIR transforms, analyses, TVMScript printer/parser/ir_builder
…8875)

## Summary
- Fix Markdown link syntax for build.sh.
- Correct docker/bash.sh usage examples to use proper image names and
shortcuts.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
)

## Summary
- Batch compute dispatches into a single GPUCommandEncoder, flushing on
sync/readback instead of per-dispatch submit to reduce JS↔GPU transition
overhead during LLM decode
- Cache uniform buffers (FIFO/512), bind groups (FIFO/256), shape
tuples, and pool MAP_READ staging buffers to eliminate redundant GPU
object creation
- Fix padding self-assignment bug in `deviceCopyToGPU`
## Summary

Support dynamic `repeats` for ONNX Tile in the Relax frontend.

## Changes

- add a dynamic Tile conversion path for ONNX when `repeats` is a graph
input
- expose `topi.dyn_tile` to the Python/packed TOPI interface
- add frontend tests for dynamic `repeats`

## Validation

- `tests/python/relax/test_frontend_onnx.py -k test_tile_dynamic_repeats
-q`
- local end-to-end repro matches ONNX Runtime

## Issue
Fixes apache#18752
…8876)

## Summary

- Remove `body` field from `AllocBufferNode` and `DeclBufferNode`,
making them flat statements consistent with `Bind`
- Buffer scope extends to end of enclosing scope via flat `SeqStmt`
semantics
- 60 files changed across core IR, codegen backends, transforms, script
IR builder, and tests

## Test plan

- All existing test suites pass (tir-transform, tir-base, tvmscript,
s_tir, codegen, C++)
apache#18881)

## Summary

- Fix `inject_texture_alloc.cc` to replace `AllocBuffer` with just a
`Bind(nd_mem_alloc_with_scope)`, consistent with `LowerVtcmAlloc`
- Previously kept a redundant `AllocBuffer` alongside the `Bind` in a
`SeqStmt`

Follow-up to apache#18876.
…generated `feature.*` attrs (apache#18883)

Fix apache#18882 

`TargetNode::ToConfig()` exports all target attrs, including derived
`feature.*` fields set by target canonicalizers. However,
`TargetInternal::FromConfig()` rejects these keys during schema
validation because they are not declared in the target kind schema. This
breaks round-tripping exported configs through `Target(config)`.

This PR strips `feature.*` keys from the config before
`ConfigSchema::Resolve`, then merges them back afterward. Canonicalizer
output is authoritative — if the canonicalizer re-emits a `feature.*`
key, it overwrites the preserved value. Unknown non-`feature.*` keys
continue to fail validation as before.

Changes:
- src/target/target.cc: Extract and re-merge `feature.*` keys around
schema resolution in `FromConfig()`
- tests/cpp/target_test.cc: Add tests for single-target round-trip,
nested-host round-trip, and continued rejection of unknown non-feature
keys

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ache#18887)

## Summary

- Fix `gpu_2d_continuous_cumsum` using `T.sblock_alloc_buffer` for `Tmp`
buffer that is used across multiple kernel launches (not within a single
sblock). Changed to `T.alloc_buffer`.
- `T.sblock_alloc_buffer` places the buffer in SBlock metadata, making
subsequent references to buffer dimensions (used by `ceil_log2`)
undefined after the AllocBuffer/DeclBuffer refactor.

Fixes apache#18885
## Summary

This PR do a rebuild of TIR Common Subexpression Elimination (CSE) using
a two-phase architecture:

- **Phase 1 — CSEPlanner**: Read-only visitor that builds a scope tree
and expression DAG. Computes a plan (InsertBeforeTable + ExprRemapTable)
in a single pass using shallower-first processing with repr propagation
— no cascade loop needed.
- **Phase 2 — CSERewriter**: Mechanical mutator that inserts
`Bind(cse_var, expr)` statements and substitutes expressions per the
plan.

Key improvements over the old implementation:
- **Simpler architecture**: Two clean classes (planner + rewriter)
instead of interleaved analysis/mutation
- **No cascade loop**: Shallower-first processing with repr propagation
resolves all CSE opportunities in one plan + one rewrite
- **Incremental DAG construction**: Expression depth, children, and
consumed counts computed during bottom-up scan — no separate traversals
- **No single-use bindings**: Consumed count tracking avoids introducing
bindings that would only be used once
- **Unified insertion via VisitStmt**: SeqStmt flattening handles all
insertion contexts uniformly

Other changes:
- Rename `CommonSubexprElimTIR` → `CommonSubexprElim`, remove
`enable_cse_tir` and `identify_equiv_terms` params
- Move old CSE tools (used by cache_index) to
`cache_index_helpers.{cc,h}`
- Remove unused `arith.detect_common_subexpr` API
- Add `T.bind` as lowercase alias for `T.Bind`
…on (apache#18889)

## Summary
- Rename `T.Bind` (capitalized) to `T.bind` (lowercase) to match
TVMScript naming convention: statement builders use lowercase
(`T.evaluate`, `T.buffer_store`, `T.bind`), expression constructors use
capitalized (`T.Cast`, `T.Select`, `T.Let`)
- Keep `Bind = bind` backward-compat alias
- Update parser, printer references, and all test files

## Test plan
- [x] tvmscript tests (771 passed)
- [x] tir-transform tests (346 passed)
- [x] tir-base tests (224 passed)
- [x] pre-commit lint passes
## Summary

Reject non-floating inputs for trig-style TIR unary ops.

## Changes

- reject non-floating inputs for trig-style TIR unary ops such as `tan`,
`sin`, and `cos`
- add the same dtype check in the Python TIR wrapper so
`topi.tan(int32)` fails early with a clear `TypeError`
- add regression tests for `tvm.tir.tan(int32)` and `topi.tan(int32)`

## Validation

- `tests/python/tir-base/test_tir_constructor.py -k
'math_unary_constructor_requires_float_dtype or
topi_tan_requires_float_dtype' -q`
- local repro for the original `where -> tan(int32)` case now fails
early with `TypeError`
- verified `topi.tan(float32)` still builds with `target="llvm"`

## Issue

Fixes apache#18769
## Summary

Reject non-float inputs for inverse trigonometric and hyperbolic unary
ops in TOPI.

## Changes

- add a shared floating-point dtype check for inverse unary math ops in
TOPI
- apply the check to `topi.acos`, `topi.acosh`, `topi.asin`,
`topi.asinh`, and `topi.atanh`
- add TE tests covering integer-input rejection for these ops
- add regression tests covering successful LLVM build for both `float32`
and `bfloat16`

## Validation

- `tests/python/te/test_te_create_primfunc.py -k 'topi_float_unary'`
- local repro now fails early with a clear `TypeError` for integer
inputs
- local regression check confirms the valid `float32` and `bfloat16`
paths still compile with LLVM

## Issue

Fixes apache#18729
## Summary
- Remove `Bind = bind` backward-compat alias from `ir.py`
- Remove `"Bind"` from `__all__` exports
- Follows apache#18889 which renamed `T.Bind` → `T.bind`

## Test plan
- [x] tvmscript roundtrip/printer/ir_builder tests pass (232 passed)
- [x] pre-commit lint passes
…pache#18892)

Add ^ anchor to the version regex so it matches only the top-level
`version = "..."` instead of all three occurrences, which caused
hit_count == 3 and a RuntimeError in sync_version.
…rating processor descriptions (apache#18884)

## Summary

Fix false rejection of `apple-m1`, `apple-m2`, and `apple-m3` as LLVM
CPU names when building TVM with LLVM 22+.

## Behavior

After following the [installation from source
instructions](https://tvm.apache.org/docs/install/from_source.html) and
building against LLVM 22, every `import tvm` produces spurious error
messages:

```
Error: Using LLVM 22.1.0 with `-mcpu=apple-m1` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
Error: Using LLVM 22.1.0 with `-mcpu=apple-m2` is not valid in `-mtriple=arm64-apple-macos`, using default `-mcpu=generic`
```

These are triggered by the Metal target tag registrations in
`python/tvm/target/tag_registry/metal.py`, which use `apple-m1` and
`apple-m2` as the host `-mcpu`. The CPUs are silently downgraded to
`generic`.

## Root cause

LLVM 22 reorganized its AArch64 processor table. `apple-m1` through
`apple-m3` are now CPU **aliases** — fully valid and accepted by
`createTargetMachine` and `isCPUStringValid()`, but no longer returned
by `MCSubtargetInfo::getAllProcessorDescriptions()`.

TVM's `LLVMTargetInfo` constructor validates `-mcpu` by enumerating
`getAllProcessorDescriptions()` and checking membership, so it misses
alias-only names.

## Fix

Replace the enumeration-based check with a new `IsValidCPU()` method
that uses `MCSubtargetInfo::isCPUStringValid()`, which correctly handles
both primary names and aliases. This API has been available since at
least LLVM 7, well before TVM's minimum supported version.

## Validation

- Built and tested on macOS (Apple Silicon) with LLVM 22.1.0
- `python -c "import tvm; print(tvm.__file__)"` produces clean output
with no error messages

---------

Co-authored-by: Gabriel Guralnick <gabriel@imbue.com>
…xportedProgram importer (apache#18903)

Fixes `TypeError: 'NoneType' object is not iterable` when importing
models with dynamic batch dimensions that contain identity slices (e.g.,
`x[:, :H, :W, :]` on a dynamic batch dim).

**Root cause:** `aten.slice.Tensor(x, 0, 0, INT_MAX)` (an identity slice
on a dynamic dim `s`) produces a result with shape `[T.min(INT_MAX, s),
...]` instead of `[s, ...]`. When this is combined with the original
tensor via `add`, TVM cannot unify the shapes, resulting in
`struct_info.shape = None`. Any subsequent `view`/`reshape` then crashes
calling `list(None)`.

This pattern appears in models like `swin_t`, where shifted window
attention crops padded features with `x[:, :H, :W, :].contiguous()`.

**Changes:**
- `exported_program_translator.py`: Skip `strided_slice` for identity
slices (`start=0, end>=INT_MAX, step=1`) and return the input tensor
directly.
- `base_fx_graph_translator.py`: Guard the identity-reshape check in
`_reshape` against `None` shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.