Skip to content

[AUTOGENERATED] develop_IFU_20260422#3174

Open
pragupta wants to merge 1918 commits intodevelopfrom
develop_IFU_20260422
Open

[AUTOGENERATED] develop_IFU_20260422#3174
pragupta wants to merge 1918 commits intodevelopfrom
develop_IFU_20260422

Conversation

@pragupta
Copy link
Copy Markdown
Collaborator

rocm_base: 293ee53

shino16 and others added 30 commits April 16, 2026 22:58
…179833)

Fixes pytorch#178871

When a `scatter_reduce`'s output has size 1 in every dimension, `_fixed_indexer` skips all size-1 dims and the store index collapses to a constant integer. The [constant-index path](https://github.com/pytorch/pytorch/blob/cd2590172dbda49308d77a5cc17a30b87c97e42b/torch/_inductor/codegen/triton.py#L3352-L3378) in `TritonKernel.indexing()` skips mask computation for non-fixed-config kernels, so OOB threads accumulate stale values. See `mask_vars = OrderedSet()` below:

https://github.com/pytorch/pytorch/blob/cd2590172dbda49308d77a5cc17a30b87c97e42b/torch/_inductor/codegen/triton.py#L3371-L3378

This PR adds a `force_mask` parameter to `indexing()` that prevents the constant-index optimization from omitting the mask. `store()` passes `force_mask=True` for `atomic_add`.

## Test plan

New `test_scatter_reduce_fused_broadcast_non_power_of_2` in `CommonTemplate`, which is essentially a copy of the repro, verifies scatter_reduce + scalar broadcast fusion for several non-power-of-2 input sizes (3, 5, 7, 9, 17, 33, 48).

Pull Request resolved: pytorch#179833
Approved by: https://github.com/jansel
Replace the runtime loop over mutated_inp_runtime_indices with a codegen'd
straight-line function that resolves set_/as_strided_/copy_/detach().copy_
branches at compile time based on each input's mutation metadata.

Updated tests to use requires_grad inputs since inference mutations are
handled inside the graph (keep_input_mutations) and don't use the runtime
epilogue. Removed metadata-only mutation test as transpose_ graph-breaks
through dynamo.

Single data mutation (`x.mul_(2)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        orig_inputs[0].copy_(updated_inputs[0])

Multiple data mutations (`a.mul_(2)`, `c.add_(1)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        orig_inputs[0].copy_(updated_inputs[0])
        orig_inputs[1].copy_(updated_inputs[1])

Leaf mutation under no_grad (`x.detach().mul_(2)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        if orig_inputs[0].requires_grad: orig_inputs[0].detach().copy_(updated_inputs[0])
        else: orig_inputs[0].copy_(updated_inputs[0])

Mutation step in isolation (us/call):

| Case | Before (loop) | After (codegen) | Speedup |
|---|---|---|---|
| 1 mutation / 2 inputs | 11.34 us | 10.96 us | 1.03x |
| 2 mutations / 4 inputs | 22.02 us | 21.95 us | 1.00x |
| 5 mutations / 10 inputs | 54.77 us | 54.12 us | 1.01x |

Performance is dominated by the copy_ calls themselves, so the Python loop
overhead removal is negligible in absolute terms. The primary benefit is
resolving the set_/as_strided_/copy_/detach().copy_ branch at compile time
rather than checking metadata flags per input at runtime.
Pull Request resolved: pytorch#179600
Approved by: https://github.com/Lucaskabela
…d Typing (pytorch#180359)

## Summary
- define `TraceableWrapperSubclass` in `torch/utils/_python_dispatch.py` as the canonical traceable wrapper subclass protocol, including the flatten/unflatten invariants and supported `__tensor_unflatten__` forms
- thread that protocol through the fake, AOT, and proxy helper boundaries so the contract is referenced directly instead of living in casts and comments
- add a focused runtime test that covers both `@staticmethod` and `@classmethod` `__tensor_unflatten__` implementations

## Root cause
The traceable wrapper subclass contract was only implied by repeated `hasattr(..., "__tensor_flatten__")` checks and an internal structural type with inaccurate method signatures. That left the canonical meaning of the protocol split across helpers, comments, and downstream assumptions.

## Proposed fix
Introduce an explicit `TraceableWrapperSubclass` protocol in `torch/utils/_python_dispatch.py`, move the contract documentation there, make `is_traceable_wrapper_subclass()` reference that canonical protocol, and update the affected fake/proxy/AOT helpers to use the protocol type directly.

## Why this is the right long term fix
This keeps the runtime behavior unchanged while giving PT2 wrapper-subclass support one documented source of truth. Future call sites can reference the same protocol instead of re-expressing the contract, which reduces drift in both signatures and invariants.

## Testing
- `python3 -m compileall torch/utils/_python_dispatch.py torch/_subclasses/fake_tensor.py torch/_functorch/_aot_autograd/subclass_utils.py torch/fx/experimental/proxy_tensor.py test/test_python_dispatch.py`
- `PYTHONPATH=/tmp/repos/pytorch/pytorch python3 test/test_python_dispatch.py TestPythonDispatch.test_traceable_wrapper_subclass_protocol_runtime_check TestPythonDispatch.test_make_fx_with_subclass TestPythonDispatch.test_make_wrapper_subclass_propagates_metadata`
- `PYTHONPATH=/tmp/repos/pytorch/pytorch python3 test/dynamo/test_subclasses.py SubclassTests.test_deferred_init_subclass_init_not_traced`

Drafted via Codex, published after manual review by @Lucaskabela
Pull Request resolved: pytorch#180359
Approved by: https://github.com/Skylion007

Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
…ch#180607)

Drop cu128/cu129 and add cu132 so the vLLM wheel build matrix matches the CUDA versions PyTorch nightly publishes (cu130, cu132). cu126 is intentionally omitted since vLLM does not release a cu126 wheel upstream. Also enables aarch64 cu130/cu132, removing the prior TODO.

### Testing

https://github.com/pytorch/pytorch/actions/runs/24533482175
Pull Request resolved: pytorch#180607
Approved by: https://github.com/atalman
)

Fix isinstance(x, OpaqueBase) to cover all opaque types

isinstance(x, OpaqueBase) is unreliable for checking if something is an
opaque object: it misses value-type opaques (e.g. Enum) and reference
types that use metaclass=OpaqueBaseMeta without inheriting OpaqueBase
(which is all that registration requires).

Fix this by making OpaqueBaseMeta.__instancecheck__ delegate to
is_opaque_value() when cls is OpaqueBase, so isinstance(x, OpaqueBase)
and case OpaqueBase() now correctly cover all registered opaque types.
Also update is_opaque_value() to see through FakeScriptObject wrappers,
matching the existing __instancecheck__ behavior for concrete subclasses.

Authored with Claude.

Pull Request resolved: pytorch#180530
Approved by: https://github.com/Lucaskabela
# Motivation
Support `torch.xpu.device_count` to work in the alternative `multiprocessing poison fork` scenario by leveraging [pyzes](https://pypi.org/project/pyzes/0.1.0/) package.

# Design
- If pyzes is not installed, `torch.xpu.device_count` will not support the `multiprocessing poison fork` scenario.
- Respect `ZE_AFFINITY_MASK`. However, if a L0 `COMPOSITE`-style hierarchy mask is detected (e.g. `ZE_AFFINITY_MASK=0.0, 0.1`), we fallback to `c10::xpu::device_count` (SYCL implementation). In this case, the `multiprocessing poison fork` scenario will not be supported.
- Align the behavior with c10::xpu::device_count`: prefer reporting dGPUs, and only report iGPU when no visible dGPU is found.
- Ensure both iGPU and dGPU handling remains consistent with `c10::xpu::device_count`.

# Tests
Add tests to validate that we correctly handle `ZE_AFFINITY_MASK`.
Pull Request resolved: pytorch#178496
Approved by: https://github.com/gujinghui, https://github.com/EikanWang
…#179600)"

This reverts commit 45af0d6.

Reverted pytorch#179600 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#179600 (comment)))
## Summary

Add gfx1103 to `PYTORCH_ROCM_ARCH` in the manywheel build script.

gfx1103 is the RDNA3 iGPU in AMD Phoenix/HawkPoint APUs (Ryzen 7040/8040 series). These are widely deployed laptop/desktop APUs. The ROCm compiler already supports gfx1103, and runtime support has already been merged:

   - hipBLASLt: pytorch#172180
   - aotriton: pytorch#168351

However, gfx1103 is missing from the wheel build arch list, so no native kernels are included in the published wheels. Users must set `HSA_OVERRIDE_GFX_VERSION=11.0.0` (or `11.0.2`), which causes GPU page faults and hard hangs due to ISA differences:

   ```
   amdgpu: [gfxhub] page fault (src_id:0 ring:169 vmid:0 pasid:0)
   amdgpu:   in page starting at address 0x0000000000000000 from client 10
   amdgpu:   Faulty UTCL2 client ID: CPC (0x5)
   amdgpu:   WALKER_ERROR: 0x1
   amdgpu:   MAPPING_ERROR: 0x1
   ```

   This follows the same pattern as pytorch#147761 which added gfx1102.

   ## Test plan

   - [x] Verify gfx1103 wheel builds successfully in ROCm CI
   - [x] Verified locally that gfx1103 is accepted by the ROCm compiler (`rocminfo` reports gfx1103, `/dev/kfd` present)
   - [x] Confirmed runtime support already merged (hipBLASLt, aotriton)

Pull Request resolved: pytorch#179653
Approved by: https://github.com/jeffdaily
…ytorch#179782)

On ROCm, require MX scaled_mm swizzle inputs to provide one value for both A and B and enforce that both are NO_SWIZZLE. Update test_passed_swizzle_arrays to use ROCm-specific expectations and add coverage for the explicit NO_SWIZZLE value check.
For nvfp4, swizzle check is skipped but eventually fails with error NVFP4 scaling not supported on ROCM.
miscellaneous fix: fix swizzle validation error messages to use correct singular/plural value wording.

Fixes pytorch#180073

Pull Request resolved: pytorch#179782
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
Move the Node class definition from function.h into a new node.h
header. function.h becomes a thin wrapper that includes node.h and
provides the free functions (create_gradient_edge, collect_next_edges,
etc.) that depend on variable.h.

This separation is needed to break the include cycle between
function.h and variable.h: function.h includes variable.h, and
variable.h needs the complete Node type for upcoming intrusive_ptr
conversion. node.h avoids this cycle by not including variable.h or
graph_task.h.

Authored with Claude.
Pull Request resolved: pytorch#179765
Approved by: https://github.com/albanD, https://github.com/soulitzer
ghstack dependencies: pytorch#179764
…piler arguments with spaces and add perf flag for xpu sycl-tla. (pytorch#178130)

Pull Request resolved: pytorch#178130
Approved by: https://github.com/Skylion007
Refactors `BuiltinVariable.cal_getattr` to a dedicated `GetAttrBuiltinVariable` variable tracker.

Here's a concise summary of the full `GetAttrBuiltinVariable` change:

## Methods

**`call_function`**
The entry point. It does two things before delegating to `_call_getattr`:

1. **`LazyVariableTracker` realization**: If `obj` is a lazy wrapper, `_call_getattr`'s early `has_pending_mutation_of_attr(obj, name)` check would use the wrapper's identity as the lookup key, not the realized VT's. Since `side_effects.store_attr_mutations` is keyed by Python object identity (`id()`), the lookup would miss — even if a mutation was recorded — causing the wrong value to be returned (concretely, a grad assignment would be lost, returning `None`). Realizing the lazy VT first gives the correct identity. This mirrors the old behavior in `BuiltinVariable._make_handler`, which realized lazy args before dispatching.

2. **Constant-fold try/except**: If `_call_getattr` raises `Unsupported` and all arguments are Python constants, we evaluate `getattr()` directly and return a constant. This avoids an unnecessary graph break for things like `getattr(SomeClass, "__name__")`. This replicates what `_make_handler`'s `call_self_handler` used to do in `BuiltinVariable`.

**`_call_getattr`**
Contains the actual logic, moved from `BuiltinVariable.call_getattr`. There are some changes as well:

**1. `hasattr` call changed from `self.call_hasattr` to `obj.call_obj_hasattr`**

The `default` argument handling previously called `self.call_hasattr(tx, obj, name_var)` (a method on `BuiltinVariable`). In the new code it calls `obj.call_obj_hasattr(tx, name)`. This is functionally equivalent — `call_hasattr` just dispatched to `obj.call_obj_hasattr` anyway — but `call_hasattr` only existed on `BuiltinVariable`, which `GetAttrBuiltinVariable` no longer inherits from.

**2. `NamedTupleVariable` added to the dispatch list**

The old code's `isinstance` check for the "known types" branch included:
```
TensorVariable, ConstantVariable, DefaultDictVariable, DistributedVariable,
UserDefinedClassVariable, UserDefinedObjectVariable
```
The new code adds `NamedTupleVariable` to that list. This means named tuple attribute access now goes through `var_getattr` explicitly rather than falling through to the `else` catch-all. In practice named tuples have a source and `var_getattr` handles them correctly, so this is a correctness improvement.

**3. `cmp_name_to_op_mapping` branch removed for `TorchInGraphFunctionVariable`**

The old code had a special case for `TorchInGraphFunctionVariable` where, if `name in cmp_name_to_op_mapping`, it returned a `GetAttrVariable`. The new code collapses this into the single `else: return GetAttrVariable(obj, name, source=source)` fallback. Looking at the git history, this branch was already dead code by the time of the rebase (commit `84be734` removed it upstream), so the new code is simply consistent with that.

Pull Request resolved: pytorch#179033
Approved by: https://github.com/anijain2305
Includes the following commits:

- Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5
- Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960
- Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce
- Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5
- Fix toggle test (pytorch/kineto#1369) ee2103c
- Link to correct fmt repo (pytorch/kineto#1345) 3447834
- Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499
- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
- Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d
- Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045
- Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423
- Add kineto-release skill (pytorch/kineto#1360) 675b6cd

Authored with Claude.
Pull Request resolved: pytorch#180606
Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007
…ch#180277)

Fixes pytorch#180011
Fixes pytorch#180012
Fixes pytorch#180013
Fixes pytorch#180014
Fixes pytorch#180015
Fixes pytorch#180016
Fixes pytorch#180021
Fixes pytorch#179952
Fixes pytorch#180022
Fixes pytorch#180025
Fixes pytorch#180027
Fixes pytorch#180028
Fixes pytorch#180029
Fixes pytorch#180549

This appears to have regressed in pytorch#177715, where combo-kernel autotuning started materializing alternate per-subkernel configs. This change keeps Triton HIP compile options such as `waves_per_eu`, `matrix_instr_nonkdim`, and `kpack` kernel-wide when combo kernels rewrite per-subkernel configs, avoiding invalid names like `waves_per_eu_0`. It also routes both baseline combo configs and sequential combo autotune trials through the same kwarg rewrite helper and adds a regression test covering combo-kernel kwarg rewriting for AMD special config args.

Made with [Cursor](https://cursor.com)

Pull Request resolved: pytorch#180277
Approved by: https://github.com/karthickai, https://github.com/jeffdaily
…rch#180497)

Fixes pytorch#180396

Issue: When torch.compile(fn, mode="reduce-overhead") captures a CUDA graph, custom stream
  ops (event record/wait) resolve the "current stream" from the external object registry. But
   the registry was populated during the dynamo bytecode prologue with the trace-time default stream — not
  the cudagraph capture stream. This caused cudaErrorStreamCaptureUnsupported.

 Fix: Register the current stream at index 0 in the external object registry at trace time.
  The inductor wrapper emits set_external_object_by_index(0, torch.cuda.current_stream()) at
  runtime, so during cudagraph capture, index 0 resolves to the actual capture stream instead
   of the stale default stream.

 Real code changes (~30 lines across 4 files):

  1. graph_bytecode_inputs.py (+9) — Added CURRENT_STREAM_INDEX = 0 constant and
  set_external_object_by_index() which updates an entry at runtime and keeps the object alive
   via the existing keep_alive list.
  2. variables/streams.py (+~20) — SymbolicStreamState.__init__ now registers the current
  stream at index 0 when the registry is fresh. Simplified _get_stream_arg to just return
  user_object_index (no conditional logic). Added back cur_stream_id() which output_graph.py
  needs.
  3. variables/builder.py (+4/-4) — When wrapping a stream with CurrentStreamSource, use
  CURRENT_STREAM_INDEX instead of allocating a new index.
  4. codegen/wrapper.py (+7) — Emit set_external_object_by_index(0,
  torch.cuda.current_stream()) at the top of the wrapper so custom ops see the actual runtime
   stream (capture stream during cudagraph recording).

Pull Request resolved: pytorch#180497
Approved by: https://github.com/Lucaskabela, https://github.com/eellison
# Motivation
In my understanding, PyTorch XPU doesn't support oneDNN block format. So this code should never be reached.

Pull Request resolved: pytorch#166861
Approved by: https://github.com/EikanWang
This reverts commit 26ab646.

Reverted pytorch#179653 on behalf of https://github.com/huydhn due to Sorry for reverting your change, I need to revert this temporarily to avoid building new Docker images ([comment](pytorch#179653 (comment)))
…)"

This reverts commit 24c273c.

Reverted pytorch#180575 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#180575 (comment)))
Replace the runtime loop over mutated_inp_runtime_indices with a codegen'd
straight-line function that resolves set_/as_strided_/copy_/detach().copy_
branches at compile time based on each input's mutation metadata.

Updated tests to use requires_grad inputs since inference mutations are
handled inside the graph (keep_input_mutations) and don't use the runtime
epilogue. Removed metadata-only mutation test as transpose_ graph-breaks
through dynamo.

Single data mutation (`x.mul_(2)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        orig_inputs[0].copy_(updated_inputs[0])

Multiple data mutations (`a.mul_(2)`, `c.add_(1)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        orig_inputs[0].copy_(updated_inputs[0])
        orig_inputs[1].copy_(updated_inputs[1])

Leaf mutation under no_grad (`x.detach().mul_(2)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        if orig_inputs[0].requires_grad: orig_inputs[0].detach().copy_(updated_inputs[0])
        else: orig_inputs[0].copy_(updated_inputs[0])

Mutation step in isolation (us/call):

| Case | Before (loop) | After (codegen) | Speedup |
|---|---|---|---|
| 1 mutation / 2 inputs | 11.34 us | 10.96 us | 1.03x |
| 2 mutations / 4 inputs | 22.02 us | 21.95 us | 1.00x |
| 5 mutations / 10 inputs | 54.77 us | 54.12 us | 1.01x |

Performance is dominated by the copy_ calls themselves, so the Python loop
overhead removal is negligible in absolute terms. The primary benefit is
resolving the set_/as_strided_/copy_/detach().copy_ branch at compile time
rather than checking metadata flags per input at runtime.
Pull Request resolved: pytorch#179600
Approved by: https://github.com/Lucaskabela
…#179600)"

This reverts commit be5f4f9.

Reverted pytorch#179600 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#179600 (comment)))
…am (pytorch#180497)"

This reverts commit 9f34b01.

Reverted pytorch#180497 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#180497 (comment)))
This change was separated from pytorch#170609, and adds a context flag to select the backend kernel for depthwise convolution. This allows easy benchmarking of cuDNN performance relative to fallback performance for heuristic tuning (and could also be used for debugging), while maintaining default behavior for standard use.

There are three options: `["auto", "cudnn", "native"]`. The `"auto"` option chooses a kernel based on the existing heuristic function. The `"cudnn"` and `"native"` flags skip the heuristic, instead always dispatching to a cuDNN kernel or a fallback kernel, respectively.

Pull Request resolved: pytorch#176500
Approved by: https://github.com/malfet, https://github.com/eqy
…per (pytorch#180586)

Mainly to avoid potentially surprising `NoValidChoicesError` on Blackwell.

When there's a program where `torch.compile` and `F.scaled_mm` and/or `F.scaled_grouped
_mm` with `use_fast_accum=True` and it's been used on Hopper GPUs, then we'd see `NoValidChoicesError` by running that program on Blackwell on a whim.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

Pull Request resolved: pytorch#180586
Approved by: https://github.com/mlazos
Replace the runtime loop over mutated_inp_runtime_indices with a codegen'd
straight-line function that resolves set_/as_strided_/copy_/detach().copy_
branches at compile time based on each input's mutation metadata.

Updated tests to use requires_grad inputs since inference mutations are
handled inside the graph (keep_input_mutations) and don't use the runtime
epilogue. Removed metadata-only mutation test as transpose_ graph-breaks
through dynamo.

Single data mutation (`x.mul_(2)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        orig_inputs[0].copy_(updated_inputs[0])

Multiple data mutations (`a.mul_(2)`, `c.add_(1)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        orig_inputs[0].copy_(updated_inputs[0])
        orig_inputs[1].copy_(updated_inputs[1])

Leaf mutation under no_grad (`x.detach().mul_(2)`):

    def _apply_mutations(orig_inputs, updated_inputs):
        if orig_inputs[0].requires_grad: orig_inputs[0].detach().copy_(updated_inputs[0])
        else: orig_inputs[0].copy_(updated_inputs[0])

Mutation step in isolation (us/call):

| Case | Before (loop) | After (codegen) | Speedup |
|---|---|---|---|
| 1 mutation / 2 inputs | 11.34 us | 10.96 us | 1.03x |
| 2 mutations / 4 inputs | 22.02 us | 21.95 us | 1.00x |
| 5 mutations / 10 inputs | 54.77 us | 54.12 us | 1.01x |

Performance is dominated by the copy_ calls themselves, so the Python loop
overhead removal is negligible in absolute terms. The primary benefit is
resolving the set_/as_strided_/copy_/detach().copy_ branch at compile time
rather than checking metadata flags per input at runtime.
Pull Request resolved: pytorch#179600
Approved by: https://github.com/Lucaskabela
This PR aims to fix the issue from pytorch#132395 by implementing a new `MKLGeneratorImpl` that stores a consistent, global `vslStream` for use in random numbers generation. This path was previously disabled due to a problem of repeating variates, caused by repeated reseeding of the MKL generator with variates from the `CPUGenerator`. This new implementation only seeds the `MKLGenerator` once using the `CPUGenerator`, and then keeps reusing the same `vslStream`, providing the full period of the RNG.

For the sake of reproducibility, the saving and restoring of the `MKLGenerator` has been linked to `CPUGenerator` state changes, and the former does not provide its own `get_state()` and `set_state()` functionality. The point was to keep the user experience identical to before -- they do not need to handle a separate `MKLGenerator` explicitly.

There already exists a test to check for repetition based on the script from pytorch#132395. It can be found `test_distribution.py` as `test_multinomial_sequential_draw()`. For the old (reseeded) implementation of the MKL `vslStream`, this test showed 21 repetitions. For this new implementation, the test gives 0 repetitions as expected.

Pull Request resolved: pytorch#151218
Approved by: https://github.com/malfet

Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com>
drisspg and others added 25 commits April 22, 2026 05:04
Before this PR
```Shell
 python /home/drisspg/meta/pytorch/agent_space/unwind_repro/bench_unwind.py
platform        : aarch64 / Linux
torch           : 2.12.0.dev20260414+cu130
git             : 1b7d09a
TORCH_SHOW_CPP  : 0
cuda available  : True
iters           : 20000 (warmup 500)

direct      20000 iters     3.740s    187.02 us/call
worker      20000 iters     0.018s      0.88 us/call
alloc       20000 iters     9.568s    478.39 us/call

```

After:

```Shell
 python /home/drisspg/meta/pytorch/agent_space/unwind_repro/bench_unwind.py
platform        : aarch64 / Linux
torch           : 2.13.0a0+git446033e
git             : 446033e
TORCH_SHOW_CPP  : 0
cuda available  : True
iters           : 20000 (warmup 500)

direct      20000 iters     0.008s      0.39 us/call
worker      20000 iters     0.007s      0.36 us/call
alloc       20000 iters     0.082s      4.08 us/call
```

Script I worked on with claude as a proxy:
```Py
"""
Microbenchmark for the aarch64 C++ unwinder regression.

Background: on aarch64, torch::unwind::unwind() now does real frame-pointer
walking. Its get_stack_bounds() calls pthread_getattr_np() per invocation,
which on the *main thread* parses /proc/self/maps every time. Non-main
threads cache stack bounds in TLS at pthread_create, so they should be fast.

Three modes:
  direct  -- gather_traceback() in a tight loop on the main thread
  worker  -- same loop on a spawned thread (expected: much faster on aarch64)
  alloc   -- allocate many small CUDA tensors with _record_memory_history on,
             exercising the full CachingAllocator -> CapturedTraceback path

Usage:
  python bench_unwind.py --mode direct --iters 20000
  python bench_unwind.py --mode worker --iters 20000
  python bench_unwind.py --mode alloc  --iters 20000
  python bench_unwind.py --mode all    --iters 20000
"""

import argparse
import os
import platform
import threading
import time

import torch
from torch._C._profiler import gather_traceback

def bench_direct(iters: int) -> float:
    t0 = time.perf_counter()
    for _ in range(iters):
        gather_traceback(python=False, script=False, cpp=True)
    return time.perf_counter() - t0

def bench_worker(iters: int) -> float:
    elapsed: list[float] = []

    def run():
        t0 = time.perf_counter()
        for _ in range(iters):
            gather_traceback(python=False, script=False, cpp=True)
        elapsed.append(time.perf_counter() - t0)

    th = threading.Thread(target=run)
    th.start()
    th.join()
    return elapsed[0]

def bench_alloc(iters: int) -> float:
    if not torch.cuda.is_available():
        raise RuntimeError("alloc mode requires CUDA")
    torch.cuda.memory._record_memory_history(enabled="all", max_entries=iters * 2)
    try:
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        for _ in range(iters):
            torch.empty(16, device="cuda")
        torch.cuda.synchronize()
        return time.perf_counter() - t0
    finally:
        torch.cuda.memory._record_memory_history(enabled=None)

def report(label: str, iters: int, seconds: float) -> None:
    us_per_call = seconds * 1e6 / iters
    print(f"{label:<8} {iters:>8} iters  {seconds:8.3f}s  {us_per_call:8.2f} us/call")

def main() -> None:
    ap = argparse.ArgumentParser()
    ap.add_argument("--mode", choices=["direct", "worker", "alloc", "all"], default="all")
    ap.add_argument("--iters", type=int, default=20000)
    ap.add_argument("--warmup", type=int, default=500)
    args = ap.parse_args()

    print(f"platform        : {platform.machine()} / {platform.system()}")
    print(f"torch           : {torch.__version__}")
    print(f"git             : {torch.version.git_version}")
    print(f"TORCH_SHOW_CPP  : {os.environ.get('TORCH_SHOW_CPP_STACKTRACES', '<unset>')}")
    print(f"cuda available  : {torch.cuda.is_available()}")
    print(f"iters           : {args.iters} (warmup {args.warmup})")
    print()

    for _ in range(args.warmup):
        gather_traceback(python=False, script=False, cpp=True)

    if args.mode in ("direct", "all"):
        report("direct", args.iters, bench_direct(args.iters))
    if args.mode in ("worker", "all"):
        report("worker", args.iters, bench_worker(args.iters))
    if args.mode in ("alloc", "all"):
        if torch.cuda.is_available():
            report("alloc", args.iters, bench_alloc(args.iters))
        else:
            print("alloc    skipped (no CUDA)")

if __name__ == "__main__":
    main()

```

Pull Request resolved: pytorch#181018
Approved by: https://github.com/ezyang
…for disable_ftz (pytorch#180789)"

This reverts commit a39ec69.

Reverted pytorch#180789 on behalf of https://github.com/karthickai due to test failure on gpu runner ([comment](pytorch#180789 (comment)))
…h#178078)

All non-trivial FakeProcessGroup collectives now copy input to output
following single-rank semantics (rank 0 communicating with itself),
instead of being no-ops that leave output buffers uninitialized.

Fixed collectives:
- allgather_coalesced: copy input tensors to all output slots
- gather: copy input to output on root rank
- scatter: copy rank's input slot to output
- reduce_scatter: copy rank's chunk from input to output
- _reduce_scatter_base: copy rank's chunk to output
- reduce_scatter_tensor_coalesced: copy rank's chunk per tensor
- alltoall_base: copy input buffer to output buffer
- alltoall: copy input tensors to output tensors

This enables single-process validation of distributed code (e.g.
FSDP, MoE expert parallelism) without NaN from uninitialized memory.
Pull Request resolved: pytorch#178078
Approved by: https://github.com/xmfan
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: pytorch#181048
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
…ity for PrivateUse1 (pytorch#180421)

Fixes pytorch#179806

### Summary

This pull request improves support for custom backends in PyTorch by ensuring that when a private use backend is renamed, the profiler activity enum (`ProfilerActivity`) also reflects this change. This allows users to refer to the custom backend by its new name in profiling code, making the API more intuitive and consistent.

### Changes

Updates to profiler activity enum for renamed backends:

* When `rename_privateuse1_backend` is called, a new alias is added to `ProfilerActivity` so users can reference the backend by its new name (e.g., `ProfilerActivity.FOO`), and the alias is added to the enum's members.
* The test `test_external_module_register_with_renamed_backend` is updated to check that `ProfilerActivity` exposes the renamed backend as an alias for `PrivateUse1` and that the alias is present in the enum's members.

Pull Request resolved: pytorch#180421
Approved by: https://github.com/fffrog
…ble_ftz (pytorch#180789)

This PR enables `disable_ftz` to flow from `config.eager_numerics.disable_ftz` into combo kernel `triton_meta`, which was previously dropped. Swapping to `**TritonKernel.triton_meta_common()` forwards `disable_ftz` alongside `enable_fp_fusion` / `launch_pdl`, matching standalone kernel behavior.

Pull Request resolved: pytorch#180789
Approved by: https://github.com/mlazos, https://github.com/eellison
)

An out= operator must:
- have the "Tag::out" tag.
- All of the mutable arguments must be write-only buffers (not read
  before write)
- it must return all mutable arguments in order.
- all of the mutable arguments must be kwarg-only arguments.

The last three restrictions are already restrictions that torchgen
places on native pytorch out= operators.

This PR:
- renames "Tag::out_variant" to "Tag::out". This is not being used in the wild
  so there are no BC-concerns. We added this API in PyTorch 2.10 but
  didn't have a good user. I prefer not calling it a "variant" because a
  user may just have a custom operator that is an "out operator" and not
  have a functional variant of the operator, so I don't want to call it
  an "out variant".
- makes it so that all native out=operators get the out tag too.
- torch.library.define (the Python api) validates the schema for the
  above three constraints.

Test Plan:
- existing and new tests
- previously we did allow `Tag::out_variant` ops to return None/void.
  Again, this is not actually being used in the wild, so I'm breaking BC
  here.

Authored with Claude.

Pull Request resolved: pytorch#180851
Approved by: https://github.com/angelayi
Allow custom operators tagged with torch.Tag.out to go through
auto_functionalize during torch.compile. Previously these were
rejected because their returns have alias annotations.

- the functional form of e.g. `foo_out(Tensor x, *, Tensor(a!) out) -> Tensor(a!)`
  is auto_functionalized(foo, x, out_shape, out_stride, out_dtype, out_device).
  The semantics of this operator are
  `foo(Tensor x) -> Tensor`. This is because the out= args are
  write-only, and we need to store the metadata to know how to allocate
  the output.
- Re-inplacing this operator is easy. We just use empty_like to allocate
  the output tensor and call foo_out.

Authored with Claude.

Pull Request resolved: pytorch#180852
Approved by: https://github.com/angelayi
ghstack dependencies: pytorch#180851
Operators tagged with Tag.out have well-defined fake kernel semantics:
return the out= arguments in declaration order. This extends the
existing auto-fake-kernel mechanism (which handles mutable ops with no
returns) to also cover Tag.out ops, removing the need for users to
manually write trivial fake/meta kernels for these operators.

Also warns when users manually register a Meta kernel for a non-builtin
Tag.out operator, since the automatic registration is preferred and the
manual version is easy to get wrong.

Authored with Claude.
Pull Request resolved: pytorch#180987
Approved by: https://github.com/angelayi
ghstack dependencies: pytorch#180851, pytorch#180852
This PR adds a comprehensive guide for integrating out-of-tree accelerator backends with PyTorch's Cross-Repository CI Relay (CRCR). Currently supports L1 (Silent) integration that automatically triggers downstream CI on PyTorch PRs.
Pull Request resolved: pytorch#180976
Approved by: https://github.com/fffrog
…for disable_ftz (pytorch#180789)"

This reverts commit 89ed986.

Reverted pytorch#180789 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#180789 (comment)))
…orch#179760)

enable num_splits param for FA2 and if num_splits=1 then for paged KV align block size to match with standard kernel. for testing i used torch.equal() to check numerics in the existing paging test.

since this involves changes in upstream flash attention repo, also bumping submodule commit

Pull Request resolved: pytorch#179760
Approved by: https://github.com/drisspg
Adding the following:

- `set_and_normalize_fake_device` to set fake device when creating fake tensor and also does the normalization logic for GPU indices
- FakeTensorMode (stores `shape_env` and `converter` from Python)
- `is_fake` function checks if keyset has Fake key

Modifying the following existing functionality:

- ExtraMeta now has `fake_device_`,
- `device_custom` will return `fake_device_` instead of `device_default` if Fake key is active

Testing:

For testing I tested with the make_fx tests higher in the stack

* pytorch#178536
Pull Request resolved: pytorch#178429
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#178536, pytorch#178428
## Summary
Fix a crash when `torch.library.custom_op` marks an optional argument as mutated and the caller omits that argument so it falls back to its default.

## Root cause problem
`CustomOpDef`'s `ADInplaceOrView` wrapper computes mutated positional and keyword argument locations from the schema, then indexes directly into the runtime `args` and `kwargs`. The dispatcher strips default-valued arguments before this Python wrapper runs, so an omitted `out: Optional[Tensor] = None` is not present in `args`, which causes `IndexError: tuple index out of range` during version bump bookkeeping.

## Proposed fix
Call `utils.fill_defaults(schema, args, kwargs)` inside the `ADInplaceOrView` wrapper before incrementing versions for mutated arguments. This materializes omitted default values only for the bookkeeping step, while leaving the original `args` and `kwargs` untouched for the actual kernel dispatch.

Also add a regression test covering a custom op with `mutates_args={"out"}` and `out: Optional[Tensor] = None`, verifying that the omitted-argument call succeeds and that the explicit `out` path still bumps the version counter.

## Why this is the right long term fix
`fill_defaults` is already the helper PyTorch uses when Python-side custom op wrappers need schema-aligned inputs. Reusing it keeps the `ADInplaceOrView` bookkeeping consistent with dispatcher semantics for default arguments and fixes both positional and keyword-only mutated defaults without changing dispatch behavior.

## Testing
- Reproduced the crash against `torch-2.13.0.dev20260416+cpu` nightly: `IndexError: tuple index out of range`
- Ran `TestCustomOpAPI.test_mutated_optional_arg_default_none` from `test/test_custom_ops.py` against the patched `torch._library.custom_ops` on the same nightly CPU wheel

Fix pytorch#180618

Drafted via Codex, published after manual review by @bobrenjc93
Pull Request resolved: pytorch#180621
Approved by: https://github.com/zou3519
…rch#172343)"

This reverts commit 938df06.

Reverted pytorch#172343 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101859905 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#172089 (comment)))
This reverts commit 0e045b5.

Reverted pytorch#172089 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101859905 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#172089 (comment)))
…ytorch#179766)"

This reverts commit 0309973.

Reverted pytorch#179766 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101855555 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#179766 (comment)))
…pytorch#178078)"

This reverts commit c6d36a4.

Reverted pytorch#178078 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in trunk ([comment](pytorch#178078 (comment)))
…80892)"

This reverts commit d4b9bd5.

Reverted pytorch#180892 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See diff D101859905 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](pytorch#180892 (comment)))
…ytorch#179686)

Summary:

When models use `torch.func.jvp` (e.g., via JVP-based forward AD), the
`_make_dual` function was not properly handled during `torch.export` tracing.

Three issues fixed:
1. **FakeTensor assertion**: The predispatch wrapper routed through
   `PreDispatchTorchFunctionMode.__torch_function__`, but `_make_dual` was not
   in the recognized function list, causing it to fall through to the raw C++
   implementation which produced real Tensors instead of FakeTensors.
2. **SpecViolationError**: After fixing (1), `_make_dual` correctly appears as a
   `call_function` node in the exported graph, but the verifier's
   `_allowed_torch_functions` allowlist was missing it.
3. **GradTrackingTensor in symbolic_shapes**: `_free_symbols` did not unwrap
   `GradTrackingTensor` (dual tensors from forward AD), causing assertion
   failures when collecting unbacked symbols during export.

Changes:
- Add `_make_dual` predispatch wrapper in `predispatch.py`
- Update `forward_ad.py` to use the predispatch-wrapped version
- Add `_make_dual` to the `proxy_tensor.py` recognized function list
- Add `_make_dual` to verifier `_allowed_torch_functions` allowlist
- Handle `is_gradtrackingtensor` in `_free_symbols` (symmetric with `is_batchedtensor`)
- Add export test for JVP models (`test_gradient_tracking_tensors`)

Test Plan:
- `buck test fbcode//caffe2/test:test_export -- test_gradient_tracking_tensors`
  (ASAN crash in boost regex teardown — infra issue, not test logic failure)
- IR dry run WF 1058511731 (ads_mtml_adfinder_heavyweight_offsite_cvr_model) SUCCEEDED
- IR dry runs for 11+ other APS models succeeded with these fixes

Differential Revision: D99692201

Pull Request resolved: pytorch#179686
Approved by: https://github.com/tugsbayasgalan
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.github/scripts/build_triton_wheel.py
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 22, 2026

Jenkins build for 6ecd5d450647bce39b28e46517aa7fe462aeb44a commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during base docker image building:

#53 15.13 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/bin:/opt/rocm/llvm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/opt/rocm/lib: git clone --recursive https://github.com/ROCm/triton triton
#53 15.15 Cloning into 'triton'...
#53 85.40 + cd triton
#53 85.40 + as_jenkins git checkout '<<<<<<<' HEAD ba5c1517e6f5906761cf5783036efb587026208d ======= 88b227e23f0445f3f695bad05bbf1a363b4f50e0 '>>>>>>>' upstream/main
#53 85.40 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/opt/rocm/bin:/opt/rocm/llvm/bin:/opt/conda/envs/py_3.12/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/opt/rocm/lib: git checkout '<<<<<<<' HEAD ba5c1517e6f5906761cf5783036efb587026208d ======= 88b227e23f0445f3f695bad05bbf1a363b4f50e0 '>>>>>>>' upstream/main
#53 85.41 error: pathspec '<<<<<<<' did not match any file(s) known to git
#53 85.41 error: pathspec 'HEAD' did not match any file(s) known to git
#53 85.41 error: pathspec 'ba5c1517e6f5906761cf5783036efb587026208d' did not match any file(s) known to git
#53 85.41 error: pathspec '=======' did not match any file(s) known to git
#53 85.41 error: pathspec '88b227e23f0445f3f695bad05bbf1a363b4f50e0' did not match any file(s) known to git
#53 85.41 error: pathspec '>>>>>>>' did not match any file(s) known to git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.