[AUTOGENERATED] rocm7.1_internal_testing_IFU_2025-09-24 by pragupta · Pull Request #2678 · ROCm/pytorch

pragupta · 2025-09-24T21:47:37Z

rocm_base: 48cac8f

Tested using: registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16643_ubuntu22.04_py3.10_pytorch_rocm7.1_internal_testing_681e60e1

"core" default UTs:
export TESTS_TO_INCLUDE="test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"
default_ut_09_24.log

"core" distributed UTs: distributed/test_c10d_common distributed/test_c10d_nccl distributed/test_distributed_spawn
distributed_ut_09_24.log

Wheels build job: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/385/

… C++ (pytorch#161695) I initially didn't see good results porting this, but it was apparently because of pybind11 function calling overhead. (pybind11's object-handling primitives seem fine enough.) I'm interested in setting up nanobind, but this demonstrates it's not blocking. Differential Revision: [D81530102](https://our.internmc.facebook.com/intern/diff/D81530102) Pull Request resolved: pytorch#161695 Approved by: https://github.com/ezyang

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: pytorch#163304 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162862

) Benchmark script: ``` import time import numpy as np import torch def main() -> None: for i in range(10): block_indices = np.arange(16384, dtype=np.int32) block_indices = block_indices.reshape(-1).clip(max=255) batch_indices = np.zeros(16384, dtype=np.int64) virtual_batches = 32 block_table = torch.randn(32, 256) start = time.perf_counter() block_table[batch_indices, block_indices].view(virtual_batches, -1) end = time.perf_counter() time_elapsed_ms = (end - start) * 1000 print(f"Function execution time: {time_elapsed_ms:.1f}ms") if __name__ == "__main__": main() ``` Before: ``` (a) [ezyang@devvm006.dkl0 ~/local/b/pytorch] python ben.py Function execution time: 28.5ms Function execution time: 12.9ms Function execution time: 12.6ms Function execution time: 13.5ms Function execution time: 12.0ms Function execution time: 13.4ms Function execution time: 12.9ms Function execution time: 12.9ms Function execution time: 13.1ms Function execution time: 13.0ms ``` After: ``` Function execution time: 17.8ms Function execution time: 2.5ms Function execution time: 1.3ms Function execution time: 2.5ms Function execution time: 2.3ms Function execution time: 1.3ms Function execution time: 2.4ms Function execution time: 2.5ms Function execution time: 2.5ms Function execution time: 2.4ms ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: pytorch#163280 Approved by: https://github.com/SherlockNoMad, https://github.com/cyyever

Fixes pytorch#163035 Pull Request resolved: pytorch#163036 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

This reverts commit 3016616. Reverted pytorch#162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

This reverts commit 2dcd153. Reverted pytorch#162862 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

…k) (pytorch#161571) Summary: dispatch MTIA to function foreach_tensor_maximum_scalar_kernel_mtia_ Test Plan: CI Rollback Plan: Differential Revision: D81086607 Pull Request resolved: pytorch#161571 Approved by: https://github.com/malfet

… LAMBDA_GUARD (pytorch#162525)" This reverts commit 5f630d2. Reverted pytorch#162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](pytorch#162525 (comment)))

…rsion (pytorch#162695)" This reverts commit a8432bc. Reverted pytorch#162695 on behalf of https://github.com/anijain2305 due to internal failure at https://fburl.com/workplace/qiitdlp6 ([comment](pytorch#162695 (comment)))

Summary: This PR is extracted from pytorch#162542, to make the original PR easier to review. This PR only contains cosmetic changes. Pull Request resolved: pytorch#163115 Approved by: https://github.com/tianyu-l ghstack dependencies: pytorch#162539, pytorch#162540, pytorch#162541

Summary: This issue proposes implementing a XPU kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU and CUDA. Motivation: Same as pytorch#159325. Pull Request resolved: pytorch#160938 Approved by: https://github.com/EikanWang, https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/jerryzh168

… /.ci/docker/ci_commit_pins (pytorch#162063) * [Dependabot] Update(deps): Bump transformers Bumps [transformers](https://github.com/huggingface/transformers) from 4.54.0 to 4.56.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](huggingface/transformers@v4.54.0...v4.56.0) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.56.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Refresh results Signed-off-by: Huy Do <huydhn@gmail.com> * Another round of updates Signed-off-by: Huy Do <huydhn@gmail.com> * Another round of update Signed-off-by: Huy Do <huydhn@gmail.com> * Hopefully the last round of update Signed-off-by: Huy Do <huydhn@gmail.com> * Plz Signed-off-by: Huy Do <huydhn@gmail.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Huy Do <huydhn@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>

…torch#163205) It seems `TEST_CUDA` is set to true even for ROCm (MI200) jobs. Changing if TEST_CUDA to an else condition to avoid running symmetric memory UTs on MI200. For other non-rocm arch, it should return true and can be skipped using other skip decorators. Pull Request resolved: pytorch#163205 Approved by: https://github.com/ezyang Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…ch#163127) PR pytorch#151360 added mx fp8 and fp4 support on ROCm. 1. However, on recent upstream, scaling function in Blas.cpp along with test_matmul_cuda changes triggered failures. This patch corrects is_blockwise_1x32_scaling function code. 2. Fixes the m, n, k dimensions for ROCm mx case. 3. Modify FP4E2M1FN_LARGEST_POW2 (largest power of 2 representable in `torch.float4_e2m1fn_x2`) to 2. This resulted in higher SQNR value for mx fp4 test. Testing result on gfx950 w/ ROCm7.0 PYTORCH_TEST_WITH_ROCM=1 python test/test_matmul_cuda.py -k test_blockwise -v Ran 452 tests in 22.698s OK passed 111 This is same as before. (when PR 151360 was merged) Pull Request resolved: pytorch#163127 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…n H100 (pytorch#162022) only cuBLAS supports float32 output and cuBLAS only supports rowwise for SM 9.0 Intended to land after pytorch#161305 Pull Request resolved: pytorch#162022 Approved by: https://github.com/ngimel

…onfig (pytorch#163318) ```Shell Up to 4x perf boost 🔝 Top 5 Performance Differences (by absolute %): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline): shape: (5, 7) ┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡ │ noop ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035 ┆ 532.580435 ┆ 4.268325 ┆ 326.832527 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557 ┆ 519.798488 ┆ 4.175271 ┆ 317.527078 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189 ┆ 512.877391 ┆ 4.136635 ┆ 313.663544 │ │ noop ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 122.827725 ┆ 496.195958 ┆ 4.039772 ┆ 303.977164 │ │ causal ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738 ┆ 484.244647 ┆ 3.910663 ┆ 291.066303 │ └───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘ 🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline): shape: (5, 7) ┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐ │ attn_type ┆ dtype ┆ shape(B,Hq,M,Hkv,N,D) ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128) ┆ 267.502004 ┆ 250.728732 ┆ 0.937297 ┆ -6.270335 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128) ┆ 248.510516 ┆ 235.210874 ┆ 0.946483 ┆ -5.351742 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295 ┆ 271.806926 ┆ 0.960936 ┆ -3.906354 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64) ┆ 282.212695 ┆ 280.519092 ┆ 0.993999 ┆ -0.600116 │ │ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073 ┆ 294.477894 ┆ 0.995315 ┆ -0.468519 │ └───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘ 📊 Performance Summary: ============================================================ Baseline: base Change: better_configs Geometric Mean Speedup (change over baseline): 1.9954x Geometric Mean % Change: +99.54% Median Speedup (change over baseline): 2.1590x Speedup Std Dev: 0.9800 Valid Comparisons: 60/60 ``` Pull Request resolved: pytorch#163318 Approved by: https://github.com/BoyuanFeng

For a custom op with multiple outputs, we will see the following generated code: ``` buf1 = op1(arg0) buf3 = buf0[0] buf4 = buf0[1] del buf1 # <--- if buf1 is not accessed in the future ``` If `buf1` is not accessed in the future, it's good to deallocate early. So we don't delay `del` until both buf3 and buf4 are not used anymore. Note that buf3 and buf4 hold reference to the data such that `del buf1` does not prevent their usage. However, when there are mutating args, we don't see `del buf1` immediately. ```python @torch.library.custom_op( "mylib::op1", mutates_args=["x"], schema="(Tensor(a!)? x) -> (Tensor, Tensor)", device_types="cuda", ) def op1(x) -> tuple[torch.Tensor, torch.Tensor]: x = x + 1 return (x + 1, x + 2) ``` <img width="661" height="821" alt="image" src="https://github.com/user-attachments/assets/3d1d1f5a-9749-4652-bb02-da593c78702d" /> Why? Because `buf3` is a MultiOutput with `buf1` as input and believes `buf1` (an output of FallbackKernel op1) has inputs that alias output. https://github.com/pytorch/pytorch/blob/72fedf05752069c9e8b97c64397aedf6ee2bf5ec/torch/_inductor/ir.py#L7976-L7982 According to `[NOTE: FallbackKernel supported operators]`, as a mutating op that are auto-functionalizable, buf1's output should NOT alias any of the inputs. This PR improves get_inputs_that_alias_output of Fallback Kernel. Use case: [moe custom op in vllm](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L2057-L2064) Pull Request resolved: pytorch#163227 Approved by: https://github.com/zou3519

…TMA template for GEMMs (pytorch#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Benchmarking command: ``` TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log ``` Rollback Plan: Differential Revision: D82597111 Pull Request resolved: pytorch#163147 Approved by: https://github.com/njriasan

As in title The auto pin update was merged without running vllm workflow Pull Request resolved: pytorch#163353 Approved by: https://github.com/malfet, https://github.com/wdvr

…ytorch#162772)" This reverts commit 49d30f9. Reverted pytorch#162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162772 (comment)))

This reverts commit c9b80c4. Reverted pytorch#162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](pytorch#162590 (comment)))

pytorch#155989) …ght and kernel_width that overflows to be exactly 0 Fixes [pytorch#155981](pytorch#155981) Pull Request resolved: pytorch#155989 Approved by: https://github.com/malfet

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

This code is a delicious spaghetti: Sometimes python version is defined in jinja template (see pytorch#162297 ) sometimes in shell script (see pytorch#162877 ), but this time around it's in a python file (and there is another one called `generate_binary_build_matrix.py` that defines `FULL_PYTHON_VERSIONS`) Pull Request resolved: pytorch#163339 Approved by: https://github.com/clee2000

Fixes pytorch#156740 Adds explicit `Any` typing to `*args` and `**kwargs` in `nn.Module.__init__()` to fix type checker errors in strict mode. Pull Request resolved: pytorch#157389 Approved by: https://github.com/Skylion007, https://github.com/Raman-RH

Improves error message reported on pytorch#163321 Pull Request resolved: pytorch#163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan

…e_format in compile (pytorch#163017) Fixes pytorch#161010 by making `clone_meta` match the semantics of strides for eager mode. This is: * Case 1: Tensor is_non_overlapping_and_dense; in this case, stride should match input tensor stride * Case 2: Otherwise, stride should be contiguous computed from input tensor using `compute_elementwise_output_strides` Pull Request resolved: pytorch#163017 Approved by: https://github.com/williamwen42, https://github.com/xmfan Co-authored-by: morrison-turnansky <mturnans@redhat.com>

Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: pytorch#163341 Approved by: https://github.com/clee2000 ghstack dependencies: pytorch#163339

This change may also resolve pytorch#161789, though verification is still needed. PR pytorch#130472 would introduced the problem of freeing the same address without clean metadata. according to the below discussion, reverted it. Pull Request resolved: pytorch#162950 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/syed-ahmed

This PR optimize `extract_file` functions: 1. `normalize_path_separator` the dest path for Windows. 2. Add verbose error message: a. On Linux, add mz_zip error string. b. On Windows, add mz_zip error string and Windows error code. For the UT `test_package_user_managed_weight`: <img width="1910" height="442" alt="image" src="https://github.com/user-attachments/assets/6a63eda1-70ce-40fb-9681-adc955463884" /> It still have issue with error code `32`, checked https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- and find the verbose is `ERROR_SHARING_VIOLATION`. It is a little complex to debug, I will continue to working on it in further PR. Pull Request resolved: pytorch#163718 Approved by: https://github.com/desertfire

…63712) Fixes pytorch#163483 Pull Request resolved: pytorch#163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501

…torch#163783) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…rch#163619) Fixes pytorch#162923 ## Test Result ### Before <img width="985" height="889" alt="image" src="https://github.com/user-attachments/assets/41de5cfa-7b25-4ba4-ade8-a6df745dcb30" /> ### After <img width="913" height="977" alt="image" src="https://github.com/user-attachments/assets/b6c06860-8db3-4b5d-9d46-31ece01fb04d" /> Pull Request resolved: pytorch#163619 Approved by: https://github.com/jbschlosser

Related to pytorch#161167 Pull Request resolved: pytorch#163778 Approved by: https://github.com/malfet

…sting_IFU_2025-09-24 # Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/common/install_rocm.sh # .ci/docker/requirements-ci.txt # CMakeLists.txt # aten/src/ATen/native/Normalization.cpp # aten/src/ATen/native/miopen/BatchNorm_miopen.cpp # requirements-build.txt # test/nn/test_convolution.py # test/test_binary_ufuncs.py # test/test_nn.py # torch/_inductor/runtime/triton_heuristics.py # torch/testing/_internal/common_utils.py

rocm-repo-management-api · 2025-09-24T22:28:05Z

Jenkins build for f3e8213081d7506307ac2224c497d69833b4f534 commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-09-24T22:49:00Z

Jenkins build for f3e8213081d7506307ac2224c497d69833b4f534 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-09-25T12:44:57Z

Jenkins build for 723c4d90538fa73cd96d2a90c3124a0843888efd commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-09-25T13:30:25Z

Jenkins build for 723c4d90538fa73cd96d2a90c3124a0843888efd commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jithunnair-amd · 2025-09-26T02:19:04Z

.ci/docker/common/install_rocm.sh


-<<<<<<< HEAD
    # temporary hipblasLT dependency install
    apt install libmsgpackc2


@pragupta This change was supposed to be temporary as per f1ad49a (cc @pruthvistony)

Can we please ascertain if this is really needed for ROCm 7.1 mainline?
cc @jeffdaily to comment on whether this is needed for the ROCm7.0 CI upstream enablement

ROCm 7 CI upgrade doesn't have this line. What was this fixing?

jithunnair-amd · 2025-09-26T02:29:43Z

torch/_inductor/runtime/triton_heuristics.py

        pass
-<<<<<<< HEAD
-
-    if not max_autotune_enabled: # Don't filter if tuning enabled


@jataylo to double-check this conflict resolution in case not already consulted

Spoke to @naromero77amd, he mentioned that these changes went into rocm7.1_internal_testing but the upstream PR is still open. So, we want to keep rocm7.1_internal_testing changes in place. He pointed me to his upstream PR here: pytorch#163908

Tried to keep local changes but some of them were not trivial as Nick's PR upstream is with newer upstream. @naromero77amd / @jataylo can you please confirm that the latest commit I pushed corrects the merge of this file?

jithunnair-amd · 2025-09-26T02:35:15Z

torch/testing/_internal/common_utils.py

        gfx_arch = prop.gcnArchName.split(":")[0]
        if gfx_arch in arch_list:
            return True
    return False


@pragupta We should track the upstreaming of this patch in one of our stories. cc @iupaikov-amd

jithunnair-amd

LGTM. Just need to track the upstreaming of one patch to make sure it doesn't continue persisting only in our fork, and need another patch to be double-checked by Jack maybe to ensure we want the upstream version to override fork version.

rocm-repo-management-api · 2025-09-30T21:04:58Z

Jenkins build for 77f4534c70a0ef9b961ea42762f1a3f2a13df6e4 commit finished as NOT_BUILT
Links: Blue Ocean view / Build artifacts

naromero77amd · 2025-09-30T21:11:55Z

torch/_inductor/runtime/triton_heuristics.py

    elif reduction_hint == ReductionHint.OUTER:
        configs = configs[-1:]
    elif reduction_hint == ReductionHint.OUTER_TINY:
-        configs = [
+        tiny_configs = [
            triton_config_reduction(
                size_hints,
                2 * (256 // rnumel) if rnumel <= 256 else 1,


Here are some corrections:

tiny_configs should be defined before if-clause the line that starts with:

# defer to more autotuning, initially if "y" in size_hints:

The two elif comments on lines 2984 and 2986 should be indented one level in. In other words, they are inside the elif not max_autotune_enabled

For the elif reduction_hint == ReductionHint.OUTER_TINY: it should just be:

configs = tiny_configs

For the outermost, "if", "elif" clause, there is also the "else" part:

else: # If autotune is enabled append tiny configs for conf in tiny_configs: if conf not in configs: configs.append(conf)

Thank you for catching these! Addressed them with the new commit. Please verify.

naromero77amd

@pragupta and I worked together to resolve conflicts in triton_heuristics.py.

rocm-repo-management-api · 2025-09-30T21:49:48Z

Jenkins build for 77f4534c70a0ef9b961ea42762f1a3f2a13df6e4 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

swolchok and others added 30 commits September 19, 2025 04:07

[BE] Update Python min version to 3.10 (pytorch#162310)

3016616

Pull Request resolved: pytorch#162310 Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi ghstack dependencies: pytorch#162862

Fix invalid indices bug for max_unpool2d/3d on MPS (pytorch#163036)

ce5637b

Fixes pytorch#163035 Pull Request resolved: pytorch#163036 Approved by: https://github.com/kulinseth, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

Revert "[BE] Update Python min version to 3.10 (pytorch#162310)"

5780478

This reverts commit 3016616. Reverted pytorch#162310 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

Revert "[CI] Move Windows build/tests to Python-3.10 (pytorch#162862)"

1708120

This reverts commit 2dcd153. Reverted pytorch#162862 on behalf of https://github.com/malfet due to Breaks some windows tests ([comment](pytorch#162862 (comment)))

Revert "[dynamo][guards] Do not construct entire framelocals dict for…

1302637

… LAMBDA_GUARD (pytorch#162525)" This reverts commit 5f630d2. Reverted pytorch#162525 on behalf of https://github.com/anijain2305 due to internal tests fail ([comment](pytorch#162525 (comment)))

[ez][CI] Run vllm workflow on vllm pin updates (pytorch#163353)

2984bfe

As in title The auto pin update was merged without running vllm workflow Pull Request resolved: pytorch#163353 Approved by: https://github.com/malfet, https://github.com/wdvr

Revert "Fix boxcox to return same result for same input in one batch (p…

a3b68c7

…ytorch#162772)" This reverts commit 49d30f9. Reverted pytorch#162772 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#162772 (comment)))

Revert "[ROCm] Bump FBGEMM commit to avoid CK errors (pytorch#162590)"

607469b

This reverts commit c9b80c4. Reverted pytorch#162590 on behalf of https://github.com/malfet due to This breaks CUDA 13 builds ([comment](pytorch#162590 (comment)))

Handling overflow for long int overflow for the product of kernel_hei… (

a0d2d84

pytorch#155989) …ght and kernel_width that overflows to be exactly 0 Fixes [pytorch#155981](pytorch#155981) Pull Request resolved: pytorch#155989 Approved by: https://github.com/malfet

[CD] Simplify NVIDIA driver installation step (pytorch#163349)

b8c5ec5

Undo changes introduced in pytorch#160956 as driver has been updated to 580 for both fleets Fixes pytorch#163342 Pull Request resolved: pytorch#163349 Approved by: https://github.com/seemethere

Realize LazyVariableTracker before raising exception (pytorch#163350)

bc7b17a

Improves error message reported on pytorch#163321 Pull Request resolved: pytorch#163350 Approved by: https://github.com/Skylion007, https://github.com/xmfan

[BE] Introduce CONDA_ROOT_DIR (pytorch#163341)

a273475

Which equal to `%CONDA_PARENT_DIR%/Miniconda3`, and replace this pattern with `%CONDA_ROOT_DIR%` throughout the codebase Pull Request resolved: pytorch#163341 Approved by: https://github.com/clee2000 ghstack dependencies: pytorch#163339

xuhancn and others added 6 commits September 24, 2025 19:27

[dist] handle discontiguous allgather/reducescatter inputs (pytorch#1…

71eec6a

…63712) Fixes pytorch#163483 Pull Request resolved: pytorch#163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501

[ROCm][CI] adjust tf32 tolerance for test_compile_kernel_advanced (py…

0dce2af

…torch#163783) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#163783 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

Remove Python 3.9 for Triton builds (pytorch#163778)

1495b35

Related to pytorch#161167 Pull Request resolved: pytorch#163778 Approved by: https://github.com/malfet

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners September 24, 2025 21:47

Fix merge conflicts

f3e8213

jithunnair-amd reviewed Sep 26, 2025

View reviewed changes

jithunnair-amd approved these changes Sep 26, 2025

View reviewed changes

pruthvistony approved these changes Sep 26, 2025

View reviewed changes

Address review comments wrt triton_heuristics and install_rocm

0ad8381

pragupta force-pushed the rocm7.1_internal_testing_IFU_2025-09-24 branch from 723c4d9 to 0ad8381 Compare September 30, 2025 20:49

update related_commits

63fcd9b

naromero77amd requested changes Sep 30, 2025

View reviewed changes

Fix more conflicts with triton_heuristics.py

77f4534

naromero77amd approved these changes Sep 30, 2025

View reviewed changes

pragupta merged commit 0e341cc into rocm7.1_internal_testing Oct 1, 2025
30 of 35 checks passed

pragupta deleted the rocm7.1_internal_testing_IFU_2025-09-24 branch October 1, 2025 21:37

Conversation

pragupta commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

jeffdaily Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pragupta Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

pragupta Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naromero77amd Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

pragupta Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

naromero77amd left a comment

Choose a reason for hiding this comment

Uh oh!

rocm-repo-management-api bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pragupta commented Sep 24, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 24, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 24, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 25, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 25, 2025 •

edited

Loading

pragupta Sep 30, 2025 •

edited

Loading

jithunnair-amd left a comment •

edited

Loading

rocm-repo-management-api bot commented Sep 30, 2025 •

edited

Loading

rocm-repo-management-api bot commented Sep 30, 2025 •

edited

Loading