Skip to content

Add Windows wheel release job to nightly-wheels CI#64

Merged
erwei-xilinx merged 9 commits into
mainfrom
add-windows-wheels-release
May 8, 2026
Merged

Add Windows wheel release job to nightly-wheels CI#64
erwei-xilinx merged 9 commits into
mainfrom
add-windows-wheels-release

Conversation

@erwei-xilinx

@erwei-xilinx erwei-xilinx commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds a parallel build-wheels-windows job to nightly-wheels.yml that builds and releases triton-xdna Windows x64 wheels alongside the existing Linux job. Both publish to the same latest-wheels release tag.
  • Python matrix is 3.10 / 3.11 / 3.12 — narrower than the Linux matrix (3.10–3.14) because Xilinx publishes mlir-air Windows wheels only for those three Pythons today.
  • XRT Windows SDK is downloaded from the pinned Xilinx/XRT 2.21.75 release (xrt_windows_sdk.zip) and extracted to C:\Program Files\AMD\xrt, where the existing Windows build infrastructure (utils/env_setup.ps1, setup.py) expects it.
  • Fixes three stale README items uncovered while writing the workflow:
    • Quick Start referenced .\utils\build_windows.ps1 which was removed during the windows-build-minimal PR cleanup; replaced with the env_setup.ps1 path that does exist.
    • Manual Build claimed mlir-air must be built from source; replaced with the mlir_air[aie] pip-install command (matches what env_setup.ps1 actually does).
    • Python version requirement updated from "3.12+" to the accurate "3.10, 3.11, or 3.12" range.

Caveats reviewers should know

  • Untested in CI. I can't run windows-latest from a Linux dev box. The first nightly is likely to need at least one fix — common things to watch for: MSVC version mismatch with triton-windows's expected toolchain, delvewheel repair failing on missing DLLs, or paths in setup.py:_build_triton_windows needing tweaks under cibuildwheel's environment.
  • No repair-wheel-command set under [tool.cibuildwheel.windows] in pyproject.toml. cibuildwheel runs delvewheel repair by default; if our wheel bundles XRT-linked binaries we may need exclusions analogous to the Linux auditwheel step. Worth watching the first run's logs.
  • setup_xrt_dev.ps1 is intentionally not invoked — that script is for stripping headers + generating .lib from a runtime-only xrt_coreutil.dll (the Ryzen AI SDK case). The full xrt_windows_sdk.zip already includes both, so it's not needed here.

Test plan

  • CI: green Linux build job (regression check — should be unaffected)
  • CI: green Windows build job for cp310/cp311/cp312
  • Verify Windows wheels appear on the latest-wheels release alongside Linux wheels
  • On a Windows machine, install the released wheel and run an example kernel

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 8, 2026 06:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the nightly wheels CI workflow to build and publish Windows x64 wheels in parallel with the existing Linux wheel builds, and updates the README’s Windows instructions to match the intended Windows wheel availability and setup flow.

Changes:

  • Adds a build-wheels-windows matrix job (cp310/cp311/cp312) to .github/workflows/nightly-wheels.yml, including XRT Windows SDK download/extract and wheel publishing to the shared latest-wheels release tag.
  • Updates README Windows requirements and setup steps (Python version range, Quick Start script path, and mlir-air install instructions).
  • Updates Windows “Known Limitations” documentation to reflect Python version constraints for Xilinx Windows wheels.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
README.md Updates Windows requirements and build/setup instructions to align with Windows wheel support.
.github/workflows/nightly-wheels.yml Adds a Windows wheel build + release job alongside the existing Linux workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment thread README.md
Comment thread .github/workflows/nightly-wheels.yml
erwei-xilinx added a commit that referenced this pull request May 8, 2026
Three Copilot review comments on PR #64:

1. env_setup.ps1 referenced non-existent hash files
   (mlir-aie-hash-windows.txt / llvm-aie-hash-windows.txt /
   mlir-air-hash-windows.txt). Rewrote to mirror env_setup.sh's
   pattern: use the existing mlir-aie-hash.txt and mlir-air-hash.txt,
   and pull llvm-aie latest nightly. Factored hash-field parsing
   into a Read-HashField helper. The script that the README's Quick
   Start instructs users to run now actually works.

2. env_setup.ps1 header said "Python 3.12 (required)" while the
   README now says "3.10, 3.11, or 3.12". Aligned the script header
   with the README.

3. Five Linux + three Windows matrix jobs all racing to update the
   same release tag risked partial uploads / overwritten artifacts.
   Removed the per-matrix-job Release wheels step and added a single
   publish-release job that:
   - needs: [build-wheels, build-wheels-windows]
   - downloads all triton-wheel-* artifacts via download-artifact
     with merge-multiple
   - calls ncipollo/release-action exactly once with the combined
     wheelhouse
   - tolerates partial build failures (publishes if at least one
     matrix succeeded)
   The release body is also unified into a single description with
   both Linux and Windows install snippets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the add-windows-wheels-release branch from 14e7a42 to 0363c97 Compare May 8, 2026 16:18
erwei-xilinx and others added 9 commits May 8, 2026 09:18
Adds a parallel build-wheels-windows job that builds and releases
triton-xdna Windows x64 wheels alongside the existing Linux job.
Wheels land on the same latest-wheels release tag.

The Python matrix is capped at 3.10/3.11/3.12 because Xilinx publishes
mlir-air Windows wheels only for those versions (the Linux matrix runs
3.10-3.14). The XRT Windows SDK is downloaded from the pinned 2.21.75
release and extracted to C:\Program Files\AMD\xrt where the existing
Windows build infrastructure (utils/env_setup.ps1, setup.py) expects it.

Also fixes three stale README items uncovered while writing the workflow:
- Quick Start referenced .\utils\build_windows.ps1 which no longer
  exists; replaced with the env_setup.ps1 path that does
- Manual Build claimed mlir-air must be built from source; replaced
  with the mlir_air[aie] pip install command
- Python version requirement updated from "3.12+" to the accurate
  "3.10, 3.11, or 3.12" range

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The xrt_windows_sdk.zip top-level is xrt_sdk/xrt/ (not xrt/), so
extracting directly to C:\Program Files\AMD\ produced
C:\Program Files\AMD\xrt_sdk\xrt\... and the build couldn't find
the headers at C:\Program Files\AMD\xrt\include\xrt\xrt_bo.h.

Extract to RUNNER_TEMP and move the inner xrt_sdk/xrt/ folder to
the expected destination. Also corrects the README's manual-install
instructions which had the same wrong assumption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Copilot review comments on PR #64:

1. env_setup.ps1 referenced non-existent hash files
   (mlir-aie-hash-windows.txt / llvm-aie-hash-windows.txt /
   mlir-air-hash-windows.txt). Rewrote to mirror env_setup.sh's
   pattern: use the existing mlir-aie-hash.txt and mlir-air-hash.txt,
   and pull llvm-aie latest nightly. Factored hash-field parsing
   into a Read-HashField helper. The script that the README's Quick
   Start instructs users to run now actually works.

2. env_setup.ps1 header said "Python 3.12 (required)" while the
   README now says "3.10, 3.11, or 3.12". Aligned the script header
   with the README.

3. Five Linux + three Windows matrix jobs all racing to update the
   same release tag risked partial uploads / overwritten artifacts.
   Removed the per-matrix-job Release wheels step and added a single
   publish-release job that:
   - needs: [build-wheels, build-wheels-windows]
   - downloads all triton-wheel-* artifacts via download-artifact
     with merge-multiple
   - calls ncipollo/release-action exactly once with the combined
     wheelhouse
   - tolerates partial build failures (publishes if at least one
     matrix succeeded)
   The release body is also unified into a single description with
   both Linux and Windows install snippets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Windows can't replace the running pip.exe wrapper while it holds the
file open. The previous "pip install --upgrade pip" line failed with:
  ERROR: To modify pip, please run the following command:
  ...python.exe -m pip install --upgrade pip

Use "python -m pip install --upgrade pip" so the upgrade runs through
the Python interpreter rather than the locked wrapper. Linux is
unaffected and keeps its existing invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cibuildwheel does not call vcvars64.bat on Windows; it relies on the
host having MSVC on PATH already. github-hosted windows-latest images
ship MSVC but leave the developer command prompt environment inactive,
so cmake fails to find cl.exe / INCLUDE / LIB during configuration:

  CMake Error: Could not find compiler set in environment variable CXX:
  cl.exe.
  CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage

Add ilammy/msvc-dev-cmd@v1 (the standard third-party action that wraps
vcvars64.bat) before the cibuildwheel step. It exports the resulting
PATH/INCLUDE/LIB to GITHUB_ENV so cibuildwheel's spawned subprocess
inherits a working MSVC toolchain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
setup.py's download_llvm_for_triton_windows() unconditionally passed
filter="data" to TarFile.extractall(). That kwarg was added in Python
3.12 (PEP 706) and backported to 3.10.12 / 3.11.4, but cibuildwheel's
bundled nuget-cpython for cp310/cp311 isn't always a backport-bearing
patch level. The Windows wheel build failed during LLVM extraction:

  TypeError: TarFile.extractall() got an unexpected keyword argument
  'filter'

Guard the kwarg behind sys.version_info >= (3, 12). Keeps the PEP 706
security filter on 3.12+, falls back to the plain call on older
Pythons regardless of patch level.

Linux is unaffected because this code path is Windows-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On Windows runners (and local Windows clones), git checkout converts
text files to CRLF by default, but our patches in third_party/ were
generated with LF endings. The mismatch in context lines makes
git apply --check report "patch failed: X: patch does not apply"
even when the actual change is fine.

apply_patches.py was bailing on first conflict — meaning when the
Sanitizer/SanitizerAttributes/CMakeLists.txt hunk in
triton_shared.patch failed CRLF matching, the PtrAnalysis.cpp hunk
(which contains the size_t = ~0ULL fix that prevents an MSVC
narrowing-conversion error) never got applied either. The Windows
wheel build then died at compile time:

  PtrAnalysis.cpp(980): error C2397: conversion from 'int' to 'size_t'
  requires a narrowing conversion

Add --ignore-whitespace to all three git apply invocations
(check, reverse-check, real apply). git apply documents this flag
as making context-line matching tolerant of whitespace differences;
in practice CRLF vs LF falls under that tolerance. Linux and macOS
behavior is unchanged because their checkouts already match the
patch's LF endings.

Note: gitattributes can't fix this because they don't cross
submodule boundaries — the affected files live inside
third_party/triton_shared/, which is its own git repo with its own
attribute scope. A workflow-only fix (core.autocrlf=false) would
help CI but not local Windows developers; this fix helps both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
windows-latest runners default to core.autocrlf=true, which rewrites
text files to CRLF on checkout. third_party/triton_shared.patch's
context lines are LF, so the rewrite makes git apply --check fail
and apply_patches.py aborts. The build then dies at compile time:

  PtrAnalysis.cpp(980): error C2397: conversion from 'int' to 'size_t'
  requires a narrowing conversion

— ironically, the exact error the patch was supposed to fix.

The earlier --ignore-whitespace addition to apply_patches.py didn't
help because git's whitespace-tolerance flags only ignore spaces
and tabs in context-line matching, not CR characters.

Set core.autocrlf=false and core.eol=lf as the very first step
(before checkout) so the runner skips the conversion entirely.
This is a CI-only fix; local Windows developers still need to
configure their global git appropriately, but apply_patches.py's
--ignore-whitespace remains as a partial mitigation for them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The triton-xdna wheel inherits METADATA from the upstream triton /
triton-windows wheel, so before this change the published wheels
identified themselves with upstream values:

  Author: Philippe Tillet, Dian Wu
  Author-email: phil@openai.com, woctordho@outlook.com
  Home-page: https://github.com/woct0rdho/triton-windows

setup.py already rewrote Name and Version in the same loop; extend
it to also rewrite Author, Author-email, and Home-page so the wheel
self-identifies as the AMD project. Affects both the Linux and
Windows wheel pipelines (same setup.py code path runs for both).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx merged commit 93c8cc6 into main May 8, 2026
12 of 13 checks passed
@erwei-xilinx erwei-xilinx deleted the add-windows-wheels-release branch May 8, 2026 23:41
@astrelsky

Copy link
Copy Markdown
Contributor

@erwei-xilinx fwiw I installed the released wheel an did not encounter any problems with msvc version (I only have the Visual Studio 2026 Build Tools installed now). I did encounter one unexpected failure, but when I reran that specific test it passed, so it's probably just a bit flaky and not much of a concern.

(venv) D:\winxdna>scripts\run_tests -v --device aie2p
Starting example test run...
Examples dir: D:\winxdna\examples
Target device: aie2p
Transform file: transform_aie2p.mlir
--------------------------------------------------
📁 Example: autotune-matmul
   ⏭️  SKIP: transform_aie2p.mlir not found for device aie2p

📁 Example: average_pool
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: average_pool.py
Command: D:\winxdna\venv\Scripts\python.exe average_pool.py
   ✅ PASS: average_pool.py

📁 Example: axpy
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: axpy.py
Command: D:\winxdna\venv\Scripts\python.exe axpy.py
   ✅ PASS: axpy.py

📁 Example: gelu
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: gelu.py
Command: D:\winxdna\venv\Scripts\python.exe gelu.py
   ✅ PASS: gelu.py

📁 Example: leaky_relu
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: leaky_relu.py
Command: D:\winxdna\venv\Scripts\python.exe leaky_relu.py
   ✅ PASS: leaky_relu.py

📁 Example: matmul_bf16_m64_n64_k64
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: matmul_bf16_m64_n64_k64.py
Command: D:\winxdna\venv\Scripts\python.exe matmul_bf16_m64_n64_k64.py
   ✅ PASS: matmul_bf16_m64_n64_k64.py

📁 Example: matmul_f32_m64_n32_k16_padded_atransposed
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: matmul_f32_m64_n32_k16_padded_atransposed.py
Command: D:\winxdna\venv\Scripts\python.exe matmul_f32_m64_n32_k16_padded_atransposed.py
stdout:
Mismatch at (470, 464): actual=0.0, expected=3947.583984375
Mismatch at (281, 388): actual=0.0, expected=4153.3486328125
Mismatch at (162, 92): actual=0.0, expected=4219.048828125
Mismatch at (375, 272): actual=0.0, expected=3944.739013671875
Mismatch at (47, 178): actual=0.0, expected=4247.2666015625
FAIL: 109/109 samples mismatched

   ❌ FAIL: matmul_f32_m64_n32_k16_padded_atransposed.py (exit code 1)

📁 Example: matmul_i8_m128_n64_k64
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: matmul_i8_m128_n64_k64.py
Command: D:\winxdna\venv\Scripts\python.exe matmul_i8_m128_n64_k64.py
   ✅ PASS: matmul_i8_m128_n64_k64.py

📁 Example: matmul_i8_m64_n64_k64
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: matmul_i8_m64_n64_k64.py
Command: D:\winxdna\venv\Scripts\python.exe matmul_i8_m64_n64_k64.py
   ✅ PASS: matmul_i8_m64_n64_k64.py

📁 Example: relu
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: relu.py
Command: D:\winxdna\venv\Scripts\python.exe relu.py
   ✅ PASS: relu.py

📁 Example: rms_norm
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: rms_norm.py
Command: D:\winxdna\venv\Scripts\python.exe rms_norm.py
stderr:
loc("-":83:11): error: application of transform.air.copy_to_dma expected to produce 1 results (actually produced 0).
loc("-":83:11): error: application of transform.air.copy_to_dma expected to produce 1 results (actually produced 0).

   ✅ PASS: rms_norm.py

📁 Example: sigmoid
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: sigmoid.py
Command: D:\winxdna\venv\Scripts\python.exe sigmoid.py
   ✅ PASS: sigmoid.py

📁 Example: silu
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: silu.py
Command: D:\winxdna\venv\Scripts\python.exe silu.py
   ✅ PASS: silu.py

📁 Example: swiglu
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: swiglu.py
Command: D:\winxdna\venv\Scripts\python.exe swiglu.py
   ✅ PASS: swiglu.py

📁 Example: test_layernorm
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: test_layernorm.py
Command: D:\winxdna\venv\Scripts\python.exe test_layernorm.py
   ✅ PASS: test_layernorm.py

📁 Example: test_softmax
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: test_softmax.py
Command: D:\winxdna\venv\Scripts\python.exe test_softmax.py
   ✅ PASS: test_softmax.py

📁 Example: vec-add
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: vec-add.py
Command: D:\winxdna\venv\Scripts\python.exe vec-add.py
   ✅ PASS: vec-add.py

📁 Example: weighted_rms_norm
   transform_aie2p.mlir detected; will set AIR_TRANSFORM_TILING_SCRIPT
   🔄 Running: weighted_rms_norm.py
Command: D:\winxdna\venv\Scripts\python.exe weighted_rms_norm.py
stderr:
Traceback (most recent call last):
  File "D:\winxdna\examples\weighted_rms_norm\weighted_rms_norm.py", line 95, in <module>
    bench_weighted_rms_norm(M, N, "test")
  File "D:\winxdna\examples\weighted_rms_norm\weighted_rms_norm.py", line 88, in bench_weighted_rms_norm
    torch.testing.assert_close(y, y_ref, atol=5e-1, rtol=1e-1)
  File "D:\winxdna\venv\Lib\site-packages\torch\testing\_comparison.py", line 1631, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 16384 (0.0%)
Greatest absolute difference: 1.5625 at index (30, 128) (up to 0.5 allowed)
Greatest relative difference: 0.158203125 at index (30, 128) (up to 0.1 allowed)

   ❌ FAIL: weighted_rms_norm.py (exit code 1)

--------------------------------------------------
Test Results:
  ✅ Passed:  15
  ❌ Failed:  2
  ⏰ Timeouts: 0
  ⏭️  Skipped: 1
  📊 Total:   17
💔 2 failed, 0 timed out

erwei-xilinx added a commit that referenced this pull request May 11, 2026
…type

Replace post-bufferize linalg_promote (which leaks self-copies that crash
transform.air.copy_to_dma) with pre-bufferize bufferize_to_allocation +
promote_tensor for L1 staging, mirroring mlir-air xrt 43_triton_layernorm.
Eliminates "expected to produce 1 results (actually produced 0)" stderr
on aie2p reported in #64.
@erwei-xilinx

Copy link
Copy Markdown
Collaborator Author

Thanks for the test run, @astrelsky! Two real bugs surfaced from this:

weighted_rms_norm (1/16384 elements) — Investigated and turned out to be a real systematic bug, not BF16 drift. Fixed by #66 — refactored the script to follow the mlir-air xrt 43_triton_layernorm prototype + a hybrid linalg_promote for the W operand.

rms_norm stderr — Visible in your output but the test passed, so it wasn't on your bug list. Was a real latent bug though: linalg_promote emits self-copies that crash transform.air.copy_to_dma's 1-result contract. Fixed by #65 with the same refactor pattern.

matmul_f32_padded_atransposed (109/109 zeros) — Expected on Windows for now; that's a known limitation while full-ELF support on Windows is still WIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants