[AUTOGENERATED] develop_IFU_20260422 by pragupta · Pull Request #3174 · ROCm/pytorch

pragupta · 2026-04-22T16:50:21Z

rocm_base: 293ee53

…179833) Fixes pytorch#178871 When a `scatter_reduce`'s output has size 1 in every dimension, `_fixed_indexer` skips all size-1 dims and the store index collapses to a constant integer. The [constant-index path](https://github.com/pytorch/pytorch/blob/cd2590172dbda49308d77a5cc17a30b87c97e42b/torch/_inductor/codegen/triton.py#L3352-L3378) in `TritonKernel.indexing()` skips mask computation for non-fixed-config kernels, so OOB threads accumulate stale values. See `mask_vars = OrderedSet()` below: https://github.com/pytorch/pytorch/blob/cd2590172dbda49308d77a5cc17a30b87c97e42b/torch/_inductor/codegen/triton.py#L3371-L3378 This PR adds a `force_mask` parameter to `indexing()` that prevents the constant-index optimization from omitting the mask. `store()` passes `force_mask=True` for `atomic_add`. ## Test plan New `test_scatter_reduce_fused_broadcast_non_power_of_2` in `CommonTemplate`, which is essentially a copy of the repro, verifies scatter_reduce + scalar broadcast fusion for several non-power-of-2 input sizes (3, 5, 7, 9, 17, 33, 48). Pull Request resolved: pytorch#179833 Approved by: https://github.com/jansel

Replace the runtime loop over mutated_inp_runtime_indices with a codegen'd straight-line function that resolves set_/as_strided_/copy_/detach().copy_ branches at compile time based on each input's mutation metadata. Updated tests to use requires_grad inputs since inference mutations are handled inside the graph (keep_input_mutations) and don't use the runtime epilogue. Removed metadata-only mutation test as transpose_ graph-breaks through dynamo. Single data mutation (`x.mul_(2)`): def _apply_mutations(orig_inputs, updated_inputs): orig_inputs[0].copy_(updated_inputs[0]) Multiple data mutations (`a.mul_(2)`, `c.add_(1)`): def _apply_mutations(orig_inputs, updated_inputs): orig_inputs[0].copy_(updated_inputs[0]) orig_inputs[1].copy_(updated_inputs[1]) Leaf mutation under no_grad (`x.detach().mul_(2)`): def _apply_mutations(orig_inputs, updated_inputs): if orig_inputs[0].requires_grad: orig_inputs[0].detach().copy_(updated_inputs[0]) else: orig_inputs[0].copy_(updated_inputs[0]) Mutation step in isolation (us/call): | Case | Before (loop) | After (codegen) | Speedup | |---|---|---|---| | 1 mutation / 2 inputs | 11.34 us | 10.96 us | 1.03x | | 2 mutations / 4 inputs | 22.02 us | 21.95 us | 1.00x | | 5 mutations / 10 inputs | 54.77 us | 54.12 us | 1.01x | Performance is dominated by the copy_ calls themselves, so the Python loop overhead removal is negligible in absolute terms. The primary benefit is resolving the set_/as_strided_/copy_/detach().copy_ branch at compile time rather than checking metadata flags per input at runtime. Pull Request resolved: pytorch#179600 Approved by: https://github.com/Lucaskabela

Fixes pytorch#180074. Fixes pytorch#180075. Fixes pytorch#180076. Fixes pytorch#180077. Fixes pytorch#180078. Fixes pytorch#179954. Pull Request resolved: pytorch#180384 Approved by: https://github.com/jeffdaily

@Lucaskabela

…d Typing (pytorch#180359) ## Summary - define `TraceableWrapperSubclass` in `torch/utils/_python_dispatch.py` as the canonical traceable wrapper subclass protocol, including the flatten/unflatten invariants and supported `__tensor_unflatten__` forms - thread that protocol through the fake, AOT, and proxy helper boundaries so the contract is referenced directly instead of living in casts and comments - add a focused runtime test that covers both `@staticmethod` and `@classmethod` `__tensor_unflatten__` implementations ## Root cause The traceable wrapper subclass contract was only implied by repeated `hasattr(..., "__tensor_flatten__")` checks and an internal structural type with inaccurate method signatures. That left the canonical meaning of the protocol split across helpers, comments, and downstream assumptions. ## Proposed fix Introduce an explicit `TraceableWrapperSubclass` protocol in `torch/utils/_python_dispatch.py`, move the contract documentation there, make `is_traceable_wrapper_subclass()` reference that canonical protocol, and update the affected fake/proxy/AOT helpers to use the protocol type directly. ## Why this is the right long term fix This keeps the runtime behavior unchanged while giving PT2 wrapper-subclass support one documented source of truth. Future call sites can reference the same protocol instead of re-expressing the contract, which reduces drift in both signatures and invariants. ## Testing - `python3 -m compileall torch/utils/_python_dispatch.py torch/_subclasses/fake_tensor.py torch/_functorch/_aot_autograd/subclass_utils.py torch/fx/experimental/proxy_tensor.py test/test_python_dispatch.py` - `PYTHONPATH=/tmp/repos/pytorch/pytorch python3 test/test_python_dispatch.py TestPythonDispatch.test_traceable_wrapper_subclass_protocol_runtime_check TestPythonDispatch.test_make_fx_with_subclass TestPythonDispatch.test_make_wrapper_subclass_propagates_metadata` - `PYTHONPATH=/tmp/repos/pytorch/pytorch python3 test/dynamo/test_subclasses.py SubclassTests.test_deferred_init_subclass_init_not_traced` Drafted via Codex, published after manual review by @Lucaskabela Pull Request resolved: pytorch#180359 Approved by: https://github.com/Skylion007 Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>

…ch#180607) Drop cu128/cu129 and add cu132 so the vLLM wheel build matrix matches the CUDA versions PyTorch nightly publishes (cu130, cu132). cu126 is intentionally omitted since vLLM does not release a cu126 wheel upstream. Also enables aarch64 cu130/cu132, removing the prior TODO. ### Testing https://github.com/pytorch/pytorch/actions/runs/24533482175 Pull Request resolved: pytorch#180607 Approved by: https://github.com/atalman

…ackward (pytorch#180617) Fixes pytorch#180445 Pull Request resolved: pytorch#180617 Approved by: https://github.com/bobrenjc93

) Fix isinstance(x, OpaqueBase) to cover all opaque types isinstance(x, OpaqueBase) is unreliable for checking if something is an opaque object: it misses value-type opaques (e.g. Enum) and reference types that use metaclass=OpaqueBaseMeta without inheriting OpaqueBase (which is all that registration requires). Fix this by making OpaqueBaseMeta.__instancecheck__ delegate to is_opaque_value() when cls is OpaqueBase, so isinstance(x, OpaqueBase) and case OpaqueBase() now correctly cover all registered opaque types. Also update is_opaque_value() to see through FakeScriptObject wrappers, matching the existing __instancecheck__ behavior for concrete subclasses. Authored with Claude. Pull Request resolved: pytorch#180530 Approved by: https://github.com/Lucaskabela

# Motivation Support `torch.xpu.device_count` to work in the alternative `multiprocessing poison fork` scenario by leveraging [pyzes](https://pypi.org/project/pyzes/0.1.0/) package. # Design - If pyzes is not installed, `torch.xpu.device_count` will not support the `multiprocessing poison fork` scenario. - Respect `ZE_AFFINITY_MASK`. However, if a L0 `COMPOSITE`-style hierarchy mask is detected (e.g. `ZE_AFFINITY_MASK=0.0, 0.1`), we fallback to `c10::xpu::device_count` (SYCL implementation). In this case, the `multiprocessing poison fork` scenario will not be supported. - Align the behavior with c10::xpu::device_count`: prefer reporting dGPUs, and only report iGPU when no visible dGPU is found. - Ensure both iGPU and dGPU handling remains consistent with `c10::xpu::device_count`. # Tests Add tests to validate that we correctly handle `ZE_AFFINITY_MASK`. Pull Request resolved: pytorch#178496 Approved by: https://github.com/gujinghui, https://github.com/EikanWang

…#179600)" This reverts commit 45af0d6. Reverted pytorch#179600 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#179600 (comment)))

## Summary Add gfx1103 to `PYTORCH_ROCM_ARCH` in the manywheel build script. gfx1103 is the RDNA3 iGPU in AMD Phoenix/HawkPoint APUs (Ryzen 7040/8040 series). These are widely deployed laptop/desktop APUs. The ROCm compiler already supports gfx1103, and runtime support has already been merged: - hipBLASLt: pytorch#172180 - aotriton: pytorch#168351 However, gfx1103 is missing from the wheel build arch list, so no native kernels are included in the published wheels. Users must set `HSA_OVERRIDE_GFX_VERSION=11.0.0` (or `11.0.2`), which causes GPU page faults and hard hangs due to ISA differences: ``` amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0) amdgpu: in page starting at address 0x0000000000000000 from client 10 amdgpu: Faulty UTCL2 client ID: CPC (0x5) amdgpu: WALKER_ERROR: 0x1 amdgpu: MAPPING_ERROR: 0x1 ``` This follows the same pattern as pytorch#147761 which added gfx1102. ## Test plan - [x] Verify gfx1103 wheel builds successfully in ROCm CI - [x] Verified locally that gfx1103 is accepted by the ROCm compiler (`rocminfo` reports gfx1103, `/dev/kfd` present) - [x] Confirmed runtime support already merged (hipBLASLt, aotriton) Pull Request resolved: pytorch#179653 Approved by: https://github.com/jeffdaily

…ytorch#179782) On ROCm, require MX scaled_mm swizzle inputs to provide one value for both A and B and enforce that both are NO_SWIZZLE. Update test_passed_swizzle_arrays to use ROCm-specific expectations and add coverage for the explicit NO_SWIZZLE value check. For nvfp4, swizzle check is skipped but eventually fails with error NVFP4 scaling not supported on ROCM. miscellaneous fix: fix swizzle validation error messages to use correct singular/plural value wording. Fixes pytorch#180073 Pull Request resolved: pytorch#179782 Approved by: https://github.com/jeffdaily, https://github.com/drisspg

Move the Node class definition from function.h into a new node.h header. function.h becomes a thin wrapper that includes node.h and provides the free functions (create_gradient_edge, collect_next_edges, etc.) that depend on variable.h. This separation is needed to break the include cycle between function.h and variable.h: function.h includes variable.h, and variable.h needs the complete Node type for upcoming intrusive_ptr conversion. node.h avoids this cycle by not including variable.h or graph_task.h. Authored with Claude. Pull Request resolved: pytorch#179765 Approved by: https://github.com/albanD, https://github.com/soulitzer ghstack dependencies: pytorch#179764

…piler arguments with spaces and add perf flag for xpu sycl-tla. (pytorch#178130) Pull Request resolved: pytorch#178130 Approved by: https://github.com/Skylion007

Refactors `BuiltinVariable.cal_getattr` to a dedicated `GetAttrBuiltinVariable` variable tracker. Here's a concise summary of the full `GetAttrBuiltinVariable` change: ## Methods **`call_function`** The entry point. It does two things before delegating to `_call_getattr`: 1. **`LazyVariableTracker` realization**: If `obj` is a lazy wrapper, `_call_getattr`'s early `has_pending_mutation_of_attr(obj, name)` check would use the wrapper's identity as the lookup key, not the realized VT's. Since `side_effects.store_attr_mutations` is keyed by Python object identity (`id()`), the lookup would miss — even if a mutation was recorded — causing the wrong value to be returned (concretely, a grad assignment would be lost, returning `None`). Realizing the lazy VT first gives the correct identity. This mirrors the old behavior in `BuiltinVariable._make_handler`, which realized lazy args before dispatching. 2. **Constant-fold try/except**: If `_call_getattr` raises `Unsupported` and all arguments are Python constants, we evaluate `getattr()` directly and return a constant. This avoids an unnecessary graph break for things like `getattr(SomeClass, "__name__")`. This replicates what `_make_handler`'s `call_self_handler` used to do in `BuiltinVariable`. **`_call_getattr`** Contains the actual logic, moved from `BuiltinVariable.call_getattr`. There are some changes as well: **1. `hasattr` call changed from `self.call_hasattr` to `obj.call_obj_hasattr`** The `default` argument handling previously called `self.call_hasattr(tx, obj, name_var)` (a method on `BuiltinVariable`). In the new code it calls `obj.call_obj_hasattr(tx, name)`. This is functionally equivalent — `call_hasattr` just dispatched to `obj.call_obj_hasattr` anyway — but `call_hasattr` only existed on `BuiltinVariable`, which `GetAttrBuiltinVariable` no longer inherits from. **2. `NamedTupleVariable` added to the dispatch list** The old code's `isinstance` check for the "known types" branch included: ``` TensorVariable, ConstantVariable, DefaultDictVariable, DistributedVariable, UserDefinedClassVariable, UserDefinedObjectVariable ``` The new code adds `NamedTupleVariable` to that list. This means named tuple attribute access now goes through `var_getattr` explicitly rather than falling through to the `else` catch-all. In practice named tuples have a source and `var_getattr` handles them correctly, so this is a correctness improvement. **3. `cmp_name_to_op_mapping` branch removed for `TorchInGraphFunctionVariable`** The old code had a special case for `TorchInGraphFunctionVariable` where, if `name in cmp_name_to_op_mapping`, it returned a `GetAttrVariable`. The new code collapses this into the single `else: return GetAttrVariable(obj, name, source=source)` fallback. Looking at the git history, this branch was already dead code by the time of the rebase (commit `84be734` removed it upstream), so the new code is simply consistent with that. Pull Request resolved: pytorch#179033 Approved by: https://github.com/anijain2305

Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude. Pull Request resolved: pytorch#180606 Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007

…ch#180277) Fixes pytorch#180011 Fixes pytorch#180012 Fixes pytorch#180013 Fixes pytorch#180014 Fixes pytorch#180015 Fixes pytorch#180016 Fixes pytorch#180021 Fixes pytorch#179952 Fixes pytorch#180022 Fixes pytorch#180025 Fixes pytorch#180027 Fixes pytorch#180028 Fixes pytorch#180029 Fixes pytorch#180549 This appears to have regressed in pytorch#177715, where combo-kernel autotuning started materializing alternate per-subkernel configs. This change keeps Triton HIP compile options such as `waves_per_eu`, `matrix_instr_nonkdim`, and `kpack` kernel-wide when combo kernels rewrite per-subkernel configs, avoiding invalid names like `waves_per_eu_0`. It also routes both baseline combo configs and sequential combo autotune trials through the same kwarg rewrite helper and adds a regression test covering combo-kernel kwarg rewriting for AMD special config args. Made with [Cursor](https://cursor.com) Pull Request resolved: pytorch#180277 Approved by: https://github.com/karthickai, https://github.com/jeffdaily

Fixes pytorch#179561 Pull Request resolved: pytorch#180575 Approved by: https://github.com/eellison

Preparing for pytorch#180142 Pull Request resolved: pytorch#180140 Approved by: https://github.com/Lucaskabela, https://github.com/malfet

Preparing for pytorch#180142 Pull Request resolved: pytorch#180141 Approved by: https://github.com/Lucaskabela

…rch#180497) Fixes pytorch#180396 Issue: When torch.compile(fn, mode="reduce-overhead") captures a CUDA graph, custom stream ops (event record/wait) resolve the "current stream" from the external object registry. But the registry was populated during the dynamo bytecode prologue with the trace-time default stream — not the cudagraph capture stream. This caused cudaErrorStreamCaptureUnsupported. Fix: Register the current stream at index 0 in the external object registry at trace time. The inductor wrapper emits set_external_object_by_index(0, torch.cuda.current_stream()) at runtime, so during cudagraph capture, index 0 resolves to the actual capture stream instead of the stale default stream. Real code changes (~30 lines across 4 files): 1. graph_bytecode_inputs.py (+9) — Added CURRENT_STREAM_INDEX = 0 constant and set_external_object_by_index() which updates an entry at runtime and keeps the object alive via the existing keep_alive list. 2. variables/streams.py (+~20) — SymbolicStreamState.__init__ now registers the current stream at index 0 when the registry is fresh. Simplified _get_stream_arg to just return user_object_index (no conditional logic). Added back cur_stream_id() which output_graph.py needs. 3. variables/builder.py (+4/-4) — When wrapping a stream with CurrentStreamSource, use CURRENT_STREAM_INDEX instead of allocating a new index. 4. codegen/wrapper.py (+7) — Emit set_external_object_by_index(0, torch.cuda.current_stream()) at the top of the wrapper so custom ops see the actual runtime stream (capture stream during cudagraph recording). Pull Request resolved: pytorch#180497 Approved by: https://github.com/Lucaskabela, https://github.com/eellison

# Motivation In my understanding, PyTorch XPU doesn't support oneDNN block format. So this code should never be reached. Pull Request resolved: pytorch#166861 Approved by: https://github.com/EikanWang

This reverts commit 26ab646. Reverted pytorch#179653 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I need to revert this temporarily to avoid building new Docker images ([comment](pytorch#179653 (comment)))

…)" This reverts commit 24c273c. Reverted pytorch#180575 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#180575 (comment)))

Replace the runtime loop over mutated_inp_runtime_indices with a codegen'd straight-line function that resolves set_/as_strided_/copy_/detach().copy_ branches at compile time based on each input's mutation metadata. Updated tests to use requires_grad inputs since inference mutations are handled inside the graph (keep_input_mutations) and don't use the runtime epilogue. Removed metadata-only mutation test as transpose_ graph-breaks through dynamo. Single data mutation (`x.mul_(2)`): def _apply_mutations(orig_inputs, updated_inputs): orig_inputs[0].copy_(updated_inputs[0]) Multiple data mutations (`a.mul_(2)`, `c.add_(1)`): def _apply_mutations(orig_inputs, updated_inputs): orig_inputs[0].copy_(updated_inputs[0]) orig_inputs[1].copy_(updated_inputs[1]) Leaf mutation under no_grad (`x.detach().mul_(2)`): def _apply_mutations(orig_inputs, updated_inputs): if orig_inputs[0].requires_grad: orig_inputs[0].detach().copy_(updated_inputs[0]) else: orig_inputs[0].copy_(updated_inputs[0]) Mutation step in isolation (us/call): | Case | Before (loop) | After (codegen) | Speedup | |---|---|---|---| | 1 mutation / 2 inputs | 11.34 us | 10.96 us | 1.03x | | 2 mutations / 4 inputs | 22.02 us | 21.95 us | 1.00x | | 5 mutations / 10 inputs | 54.77 us | 54.12 us | 1.01x | Performance is dominated by the copy_ calls themselves, so the Python loop overhead removal is negligible in absolute terms. The primary benefit is resolving the set_/as_strided_/copy_/detach().copy_ branch at compile time rather than checking metadata flags per input at runtime. Pull Request resolved: pytorch#179600 Approved by: https://github.com/Lucaskabela

…#179600)" This reverts commit be5f4f9. Reverted pytorch#179600 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#179600 (comment)))

…am (pytorch#180497)" This reverts commit 9f34b01. Reverted pytorch#180497 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#180497 (comment)))

This change was separated from pytorch#170609, and adds a context flag to select the backend kernel for depthwise convolution. This allows easy benchmarking of cuDNN performance relative to fallback performance for heuristic tuning (and could also be used for debugging), while maintaining default behavior for standard use. There are three options: `["auto", "cudnn", "native"]`. The `"auto"` option chooses a kernel based on the existing heuristic function. The `"cudnn"` and `"native"` flags skip the heuristic, instead always dispatching to a cuDNN kernel or a fallback kernel, respectively. Pull Request resolved: pytorch#176500 Approved by: https://github.com/malfet, https://github.com/eqy

…per (pytorch#180586) Mainly to avoid potentially surprising `NoValidChoicesError` on Blackwell. When there's a program where `torch.compile` and `F.scaled_mm` and/or `F.scaled_grouped _mm` with `use_fast_accum=True` and it's been used on Hopper GPUs, then we'd see `NoValidChoicesError` by running that program on Blackwell on a whim. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> Pull Request resolved: pytorch#180586 Approved by: https://github.com/mlazos

Replace the runtime loop over mutated_inp_runtime_indices with a codegen'd straight-line function that resolves set_/as_strided_/copy_/detach().copy_ branches at compile time based on each input's mutation metadata. Updated tests to use requires_grad inputs since inference mutations are handled inside the graph (keep_input_mutations) and don't use the runtime epilogue. Removed metadata-only mutation test as transpose_ graph-breaks through dynamo. Single data mutation (`x.mul_(2)`): def _apply_mutations(orig_inputs, updated_inputs): orig_inputs[0].copy_(updated_inputs[0]) Multiple data mutations (`a.mul_(2)`, `c.add_(1)`): def _apply_mutations(orig_inputs, updated_inputs): orig_inputs[0].copy_(updated_inputs[0]) orig_inputs[1].copy_(updated_inputs[1]) Leaf mutation under no_grad (`x.detach().mul_(2)`): def _apply_mutations(orig_inputs, updated_inputs): if orig_inputs[0].requires_grad: orig_inputs[0].detach().copy_(updated_inputs[0]) else: orig_inputs[0].copy_(updated_inputs[0]) Mutation step in isolation (us/call): | Case | Before (loop) | After (codegen) | Speedup | |---|---|---|---| | 1 mutation / 2 inputs | 11.34 us | 10.96 us | 1.03x | | 2 mutations / 4 inputs | 22.02 us | 21.95 us | 1.00x | | 5 mutations / 10 inputs | 54.77 us | 54.12 us | 1.01x | Performance is dominated by the copy_ calls themselves, so the Python loop overhead removal is negligible in absolute terms. The primary benefit is resolving the set_/as_strided_/copy_/detach().copy_ branch at compile time rather than checking metadata flags per input at runtime. Pull Request resolved: pytorch#179600 Approved by: https://github.com/Lucaskabela

This PR aims to fix the issue from pytorch#132395 by implementing a new `MKLGeneratorImpl` that stores a consistent, global `vslStream` for use in random numbers generation. This path was previously disabled due to a problem of repeating variates, caused by repeated reseeding of the MKL generator with variates from the `CPUGenerator`. This new implementation only seeds the `MKLGenerator` once using the `CPUGenerator`, and then keeps reusing the same `vslStream`, providing the full period of the RNG. For the sake of reproducibility, the saving and restoring of the `MKLGenerator` has been linked to `CPUGenerator` state changes, and the former does not provide its own `get_state()` and `set_state()` functionality. The point was to keep the user experience identical to before -- they do not need to handle a separate `MKLGenerator` explicitly. There already exists a test to check for repetition based on the script from pytorch#132395. It can be found `test_distribution.py` as `test_multinomial_sequential_draw()`. For the old (reseeded) implementation of the MKL `vslStream`, this test showed 21 repetitions. For this new implementation, the test gives 0 repetitions as expected. Pull Request resolved: pytorch#151218 Approved by: https://github.com/malfet Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>

Before this PR ```Shell python /home/drisspg/meta/pytorch/agent_space/unwind_repro/bench_unwind.py platform : aarch64 / Linux torch : 2.12.0.dev20260414+cu130 git : 1b7d09a TORCH_SHOW_CPP : 0 cuda available : True iters : 20000 (warmup 500) direct 20000 iters 3.740s 187.02 us/call worker 20000 iters 0.018s 0.88 us/call alloc 20000 iters 9.568s 478.39 us/call ``` After: ```Shell python /home/drisspg/meta/pytorch/agent_space/unwind_repro/bench_unwind.py platform : aarch64 / Linux torch : 2.13.0a0+git446033e git : 446033e TORCH_SHOW_CPP : 0 cuda available : True iters : 20000 (warmup 500) direct 20000 iters 0.008s 0.39 us/call worker 20000 iters 0.007s 0.36 us/call alloc 20000 iters 0.082s 4.08 us/call ``` Script I worked on with claude as a proxy: ```Py """ Microbenchmark for the aarch64 C++ unwinder regression. Background: on aarch64, torch::unwind::unwind() now does real frame-pointer walking. Its get_stack_bounds() calls pthread_getattr_np() per invocation, which on the *main thread* parses /proc/self/maps every time. Non-main threads cache stack bounds in TLS at pthread_create, so they should be fast. Three modes: direct -- gather_traceback() in a tight loop on the main thread worker -- same loop on a spawned thread (expected: much faster on aarch64) alloc -- allocate many small CUDA tensors with _record_memory_history on, exercising the full CachingAllocator -> CapturedTraceback path Usage: python bench_unwind.py --mode direct --iters 20000 python bench_unwind.py --mode worker --iters 20000 python bench_unwind.py --mode alloc --iters 20000 python bench_unwind.py --mode all --iters 20000 """ import argparse import os import platform import threading import time import torch from torch._C._profiler import gather_traceback def bench_direct(iters: int) -> float: t0 = time.perf_counter() for _ in range(iters): gather_traceback(python=False, script=False, cpp=True) return time.perf_counter() - t0 def bench_worker(iters: int) -> float: elapsed: list[float] = [] def run(): t0 = time.perf_counter() for _ in range(iters): gather_traceback(python=False, script=False, cpp=True) elapsed.append(time.perf_counter() - t0) th = threading.Thread(target=run) th.start() th.join() return elapsed[0] def bench_alloc(iters: int) -> float: if not torch.cuda.is_available(): raise RuntimeError("alloc mode requires CUDA") torch.cuda.memory._record_memory_history(enabled="all", max_entries=iters * 2) try: torch.cuda.synchronize() t0 = time.perf_counter() for _ in range(iters): torch.empty(16, device="cuda") torch.cuda.synchronize() return time.perf_counter() - t0 finally: torch.cuda.memory._record_memory_history(enabled=None) def report(label: str, iters: int, seconds: float) -> None: us_per_call = seconds * 1e6 / iters print(f"{label:<8} {iters:>8} iters {seconds:8.3f}s {us_per_call:8.2f} us/call") def main() -> None: ap = argparse.ArgumentParser() ap.add_argument("--mode", choices=["direct", "worker", "alloc", "all"], default="all") ap.add_argument("--iters", type=int, default=20000) ap.add_argument("--warmup", type=int, default=500) args = ap.parse_args() print(f"platform : {platform.machine()} / {platform.system()}") print(f"torch : {torch.__version__}") print(f"git : {torch.version.git_version}") print(f"TORCH_SHOW_CPP : {os.environ.get('TORCH_SHOW_CPP_STACKTRACES', '<unset>')}") print(f"cuda available : {torch.cuda.is_available()}") print(f"iters : {args.iters} (warmup {args.warmup})") print() for _ in range(args.warmup): gather_traceback(python=False, script=False, cpp=True) if args.mode in ("direct", "all"): report("direct", args.iters, bench_direct(args.iters)) if args.mode in ("worker", "all"): report("worker", args.iters, bench_worker(args.iters)) if args.mode in ("alloc", "all"): if torch.cuda.is_available(): report("alloc", args.iters, bench_alloc(args.iters)) else: print("alloc skipped (no CUDA)") if __name__ == "__main__": main() ``` Pull Request resolved: pytorch#181018 Approved by: https://github.com/ezyang

…for disable_ftz (pytorch#180789)" This reverts commit a39ec69. Reverted pytorch#180789 on behalf of https://github.com/karthickai due to test failure on gpu runner ([comment](pytorch#180789 (comment)))

…h#178078) All non-trivial FakeProcessGroup collectives now copy input to output following single-rank semantics (rank 0 communicating with itself), instead of being no-ops that leave output buffers uninitialized. Fixed collectives: - allgather_coalesced: copy input tensors to all output slots - gather: copy input to output on root rank - scatter: copy rank's input slot to output - reduce_scatter: copy rank's chunk from input to output - _reduce_scatter_base: copy rank's chunk to output - reduce_scatter_tensor_coalesced: copy rank's chunk per tensor - alltoall_base: copy input buffer to output buffer - alltoall: copy input tensors to output tensors This enables single-process validation of distributed code (e.g. FSDP, MoE expert parallelism) without NaN from uninitialized memory. Pull Request resolved: pytorch#178078 Approved by: https://github.com/xmfan

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#181048 Approved by: https://github.com/pytorchbot Co-authored-by: Huy Do <huydhn@gmail.com>

…ity for PrivateUse1 (pytorch#180421) Fixes pytorch#179806 ### Summary This pull request improves support for custom backends in PyTorch by ensuring that when a private use backend is renamed, the profiler activity enum (`ProfilerActivity`) also reflects this change. This allows users to refer to the custom backend by its new name in profiling code, making the API more intuitive and consistent. ### Changes Updates to profiler activity enum for renamed backends: * When `rename_privateuse1_backend` is called, a new alias is added to `ProfilerActivity` so users can reference the backend by its new name (e.g., `ProfilerActivity.FOO`), and the alias is added to the enum's members. * The test `test_external_module_register_with_renamed_backend` is updated to check that `ProfilerActivity` exposes the renamed backend as an alias for `PrivateUse1` and that the alias is present in the enum's members. Pull Request resolved: pytorch#180421 Approved by: https://github.com/fffrog

…h#181015) Fixes pytorch#180980 Pull Request resolved: pytorch#181015 Approved by: https://github.com/shunting314

Follows pytorch#177953 Pull Request resolved: pytorch#180825 Approved by: https://github.com/huydhn

…ble_ftz (pytorch#180789) This PR enables `disable_ftz` to flow from `config.eager_numerics.disable_ftz` into combo kernel `triton_meta`, which was previously dropped. Swapping to `**TritonKernel.triton_meta_common()` forwards `disable_ftz` alongside `enable_fp_fusion` / `launch_pdl`, matching standalone kernel behavior. Pull Request resolved: pytorch#180789 Approved by: https://github.com/mlazos, https://github.com/eellison

) An out= operator must: - have the "Tag::out" tag. - All of the mutable arguments must be write-only buffers (not read before write) - it must return all mutable arguments in order. - all of the mutable arguments must be kwarg-only arguments. The last three restrictions are already restrictions that torchgen places on native pytorch out= operators. This PR: - renames "Tag::out_variant" to "Tag::out". This is not being used in the wild so there are no BC-concerns. We added this API in PyTorch 2.10 but didn't have a good user. I prefer not calling it a "variant" because a user may just have a custom operator that is an "out operator" and not have a functional variant of the operator, so I don't want to call it an "out variant". - makes it so that all native out=operators get the out tag too. - torch.library.define (the Python api) validates the schema for the above three constraints. Test Plan: - existing and new tests - previously we did allow `Tag::out_variant` ops to return None/void. Again, this is not actually being used in the wild, so I'm breaking BC here. Authored with Claude. Pull Request resolved: pytorch#180851 Approved by: https://github.com/angelayi

Allow custom operators tagged with torch.Tag.out to go through auto_functionalize during torch.compile. Previously these were rejected because their returns have alias annotations. - the functional form of e.g. `foo_out(Tensor x, *, Tensor(a!) out) -> Tensor(a!)` is auto_functionalized(foo, x, out_shape, out_stride, out_dtype, out_device). The semantics of this operator are `foo(Tensor x) -> Tensor`. This is because the out= args are write-only, and we need to store the metadata to know how to allocate the output. - Re-inplacing this operator is easy. We just use empty_like to allocate the output tensor and call foo_out. Authored with Claude. Pull Request resolved: pytorch#180852 Approved by: https://github.com/angelayi ghstack dependencies: pytorch#180851

Operators tagged with Tag.out have well-defined fake kernel semantics: return the out= arguments in declaration order. This extends the existing auto-fake-kernel mechanism (which handles mutable ops with no returns) to also cover Tag.out ops, removing the need for users to manually write trivial fake/meta kernels for these operators. Also warns when users manually register a Meta kernel for a non-builtin Tag.out operator, since the automatic registration is preferred and the manual version is easy to get wrong. Authored with Claude. Pull Request resolved: pytorch#180987 Approved by: https://github.com/angelayi ghstack dependencies: pytorch#180851, pytorch#180852

This PR adds a comprehensive guide for integrating out-of-tree accelerator backends with PyTorch's Cross-Repository CI Relay (CRCR). Currently supports L1 (Silent) integration that automatically triggers downstream CI on PyTorch PRs. Pull Request resolved: pytorch#180976 Approved by: https://github.com/fffrog

Update to head of https://github.com/triton-lang/triton/tree/release/3.7.x Pull Request resolved: pytorch#181001 Approved by: https://github.com/njriasan, https://github.com/CRobeck

…for disable_ftz (pytorch#180789)" This reverts commit 89ed986. Reverted pytorch#180789 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#180789 (comment)))

…orch#179760) enable num_splits param for FA2 and if num_splits=1 then for paged KV align block size to match with standard kernel. for testing i used torch.equal() to check numerics in the existing paging test. since this involves changes in upstream flash attention repo, also bumping submodule commit Pull Request resolved: pytorch#179760 Approved by: https://github.com/drisspg

Adding the following: - `set_and_normalize_fake_device` to set fake device when creating fake tensor and also does the normalization logic for GPU indices - FakeTensorMode (stores `shape_env` and `converter` from Python) - `is_fake` function checks if keyset has Fake key Modifying the following existing functionality: - ExtraMeta now has `fake_device_`, - `device_custom` will return `fake_device_` instead of `device_default` if Fake key is active Testing: For testing I tested with the make_fx tests higher in the stack * pytorch#178536 Pull Request resolved: pytorch#178429 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#178536, pytorch#178428

)" This reverts commit 8e58f30. Reverted pytorch#179631 on behalf of https://github.com/colesbury due to segfault on exit ([comment](pytorch#179631 (comment)))

@bobrenjc93

## Summary Fix a crash when `torch.library.custom_op` marks an optional argument as mutated and the caller omits that argument so it falls back to its default. ## Root cause problem `CustomOpDef`'s `ADInplaceOrView` wrapper computes mutated positional and keyword argument locations from the schema, then indexes directly into the runtime `args` and `kwargs`. The dispatcher strips default-valued arguments before this Python wrapper runs, so an omitted `out: Optional[Tensor] = None` is not present in `args`, which causes `IndexError: tuple index out of range` during version bump bookkeeping. ## Proposed fix Call `utils.fill_defaults(schema, args, kwargs)` inside the `ADInplaceOrView` wrapper before incrementing versions for mutated arguments. This materializes omitted default values only for the bookkeeping step, while leaving the original `args` and `kwargs` untouched for the actual kernel dispatch. Also add a regression test covering a custom op with `mutates_args={"out"}` and `out: Optional[Tensor] = None`, verifying that the omitted-argument call succeeds and that the explicit `out` path still bumps the version counter. ## Why this is the right long term fix `fill_defaults` is already the helper PyTorch uses when Python-side custom op wrappers need schema-aligned inputs. Reusing it keeps the `ADInplaceOrView` bookkeeping consistent with dispatcher semantics for default arguments and fixes both positional and keyword-only mutated defaults without changing dispatch behavior. ## Testing - Reproduced the crash against `torch-2.13.0.dev20260416+cpu` nightly: `IndexError: tuple index out of range` - Ran `TestCustomOpAPI.test_mutated_optional_arg_default_none` from `test/test_custom_ops.py` against the patched `torch._library.custom_ops` on the same nightly CPU wheel Fix pytorch#180618 Drafted via Codex, published after manual review by @bobrenjc93 Pull Request resolved: pytorch#180621 Approved by: https://github.com/zou3519

…rch#172343)" This reverts commit 938df06. Reverted pytorch#172343 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101859905 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#172089 (comment)))

This reverts commit 0e045b5. Reverted pytorch#172089 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101859905 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#172089 (comment)))

…ytorch#179766)" This reverts commit 0309973. Reverted pytorch#179766 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101855555 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#179766 (comment)))

…pytorch#178078)" This reverts commit c6d36a4. Reverted pytorch#178078 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in trunk ([comment](pytorch#178078 (comment)))

…80892)" This reverts commit d4b9bd5. Reverted pytorch#180892 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101859905 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#180892 (comment)))

…ytorch#179686) Summary: When models use `torch.func.jvp` (e.g., via JVP-based forward AD), the `_make_dual` function was not properly handled during `torch.export` tracing. Three issues fixed: 1. **FakeTensor assertion**: The predispatch wrapper routed through `PreDispatchTorchFunctionMode.__torch_function__`, but `_make_dual` was not in the recognized function list, causing it to fall through to the raw C++ implementation which produced real Tensors instead of FakeTensors. 2. **SpecViolationError**: After fixing (1), `_make_dual` correctly appears as a `call_function` node in the exported graph, but the verifier's `_allowed_torch_functions` allowlist was missing it. 3. **GradTrackingTensor in symbolic_shapes**: `_free_symbols` did not unwrap `GradTrackingTensor` (dual tensors from forward AD), causing assertion failures when collecting unbacked symbols during export. Changes: - Add `_make_dual` predispatch wrapper in `predispatch.py` - Update `forward_ad.py` to use the predispatch-wrapped version - Add `_make_dual` to the `proxy_tensor.py` recognized function list - Add `_make_dual` to verifier `_allowed_torch_functions` allowlist - Handle `is_gradtrackingtensor` in `_free_symbols` (symmetric with `is_batchedtensor`) - Add export test for JVP models (`test_gradient_tracking_tensors`) Test Plan: - `buck test fbcode//caffe2/test:test_export -- test_gradient_tracking_tensors` (ASAN crash in boost regex teardown — infra issue, not test logic failure) - IR dry run WF 1058511731 (ads_mtml_adfinder_heavyweight_offsite_cvr_model) SUCCEEDED - IR dry runs for 11+ other APS models succeeded with these fixes Differential Revision: D99692201 Pull Request resolved: pytorch#179686 Approved by: https://github.com/tugsbayasgalan

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .github/scripts/build_triton_wheel.py

rocm-repo-management-api · 2026-04-22T16:56:33Z

Jenkins build for 6ecd5d450647bce39b28e46517aa7fe462aeb44a commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during base docker image building:

#53 15.13 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/bin:/opt/rocm/llvm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/opt/rocm/lib: git clone --recursive https://github.com/ROCm/triton triton
#53 15.15 Cloning into 'triton'...
#53 85.40 + cd triton
#53 85.40 + as_jenkins git checkout '<<<<<<<' HEAD ba5c1517e6f5906761cf5783036efb587026208d ======= 88b227e23f0445f3f695bad05bbf1a363b4f50e0 '>>>>>>>' upstream/main
#53 85.40 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/bin:/opt/rocm/llvm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/opt/rocm/lib: git checkout '<<<<<<<' HEAD ba5c1517e6f5906761cf5783036efb587026208d ======= 88b227e23f0445f3f695bad05bbf1a363b4f50e0 '>>>>>>>' upstream/main
#53 85.41 error: pathspec '<<<<<<<' did not match any file(s) known to git
#53 85.41 error: pathspec 'HEAD' did not match any file(s) known to git
#53 85.41 error: pathspec 'ba5c1517e6f5906761cf5783036efb587026208d' did not match any file(s) known to git
#53 85.41 error: pathspec '=======' did not match any file(s) known to git
#53 85.41 error: pathspec '88b227e23f0445f3f695bad05bbf1a363b4f50e0' did not match any file(s) known to git
#53 85.41 error: pathspec '>>>>>>>' did not match any file(s) known to git

shino16 and others added 30 commits April 16, 2026 22:58

[ROCm] Update scaled_mm DeepSeek error message (pytorch#180384)

c83ef11

Fixes pytorch#180074. Fixes pytorch#180075. Fixes pytorch#180076. Fixes pytorch#180077. Fixes pytorch#180078. Fixes pytorch#179954. Pull Request resolved: pytorch#180384 Approved by: https://github.com/jeffdaily

Make strides consistent for _scaled_dot_product_efficient_attention_b…

6b7601f

…ackward (pytorch#180617) Fixes pytorch#180445 Pull Request resolved: pytorch#180617 Approved by: https://github.com/bobrenjc93

[Inductor][CUTLASS] Use subprocess.list2cmdline to properly quote com…

fe11c39

…piler arguments with spaces and add perf flag for xpu sycl-tla. (pytorch#178130) Pull Request resolved: pytorch#178130 Approved by: https://github.com/Skylion007

[inductor] Fix fp32=>bf16=>fp32 cast being lost (pytorch#180575)

24c273c

Fixes pytorch#179561 Pull Request resolved: pytorch#180575 Approved by: https://github.com/eellison

Remove unused noqa directives in non-torch/, batch 1 (pytorch#180140)

05550c3

Preparing for pytorch#180142 Pull Request resolved: pytorch#180140 Approved by: https://github.com/Lucaskabela, https://github.com/malfet

Remove unused noqa directives in non-torch/, batch 2 (pytorch#180141)

34b12dd

Preparing for pytorch#180142 Pull Request resolved: pytorch#180141 Approved by: https://github.com/Lucaskabela

[xpu][fix] Refine oneDNN stride check (pytorch#166861)

91f8173

# Motivation In my understanding, PyTorch XPU doesn't support oneDNN block format. So this code should never be reached. Pull Request resolved: pytorch#166861 Approved by: https://github.com/EikanWang

drisspg and others added 25 commits April 22, 2026 05:04

Revert "[Inductor] Use triton_meta_common() in combo kernel jit_line …

a54d6de

…for disable_ftz (pytorch#180789)" This reverts commit a39ec69. Reverted pytorch#180789 on behalf of https://github.com/karthickai due to test failure on gpu runner ([comment](pytorch#180789 (comment)))

[inductor] Fix CPU CPP fusion crash on GroupNorm + SDPA + bmm (pytorc…

c56d764

…h#181015) Fixes pytorch#180980 Pull Request resolved: pytorch#181015 Approved by: https://github.com/shunting314

[Benchmark] Fix xpu benchmark workflow issue (pytorch#180825)

12c7b75

Follows pytorch#177953 Pull Request resolved: pytorch#180825 Approved by: https://github.com/huydhn

[Triton 3.7] Update Triton hash (pytorch#181001)

b5784b9

Update to head of https://github.com/triton-lang/triton/tree/release/3.7.x Pull Request resolved: pytorch#181001 Approved by: https://github.com/njriasan, https://github.com/CRobeck

Revert "[BE] Remove pyobj_interpreter_ from PyObjectSlot (pytorch#179631

8908772

)" This reverts commit 8e58f30. Reverted pytorch#179631 on behalf of https://github.com/colesbury due to segfault on exit ([comment](pytorch#179631 (comment)))

Merge remote-tracking branch 'upstream/main' into develop_IFU_20260422

6ecd5d4

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .github/scripts/build_triton_wheel.py

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners April 22, 2026 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AUTOGENERATED] develop_IFU_20260422#3174

[AUTOGENERATED] develop_IFU_20260422#3174
pragupta wants to merge 1918 commits intodevelopfrom
develop_IFU_20260422

pragupta commented Apr 22, 2026

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pragupta commented Apr 22, 2026

Uh oh!

rocm-repo-management-api Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

rocm-repo-management-api Bot commented Apr 22, 2026 •

edited

Loading