Add native Windows support for Triton-XDNA#35
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds native Windows support for the Triton-XDNA toolchain by updating build/package logic, compiler/driver path handling, and providing Windows-focused scripts and documentation.
Changes:
- Add PowerShell scripts for Windows environment setup and end-to-end builds (MSVC + Ninja).
- Extend packaging/build logic to recognize Windows (exe/pyd naming, disabling clang+lld flags on Windows).
- Update the NPU backend driver/compiler to handle Windows paths, extensions, and XRT SDK detection; document Windows usage in README.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
utils/env_setup.ps1 |
Installs pinned mlir-aie/llvm-aie/mlir-air and sets PATH/PYTHONPATH; adds basic XRT dev dir auto-detection. |
utils/build_windows.ps1 |
Automated Windows build script (wheel installs, MLIR distro download, mlir-air build with MSVC, editable install). |
setup.py |
Allows Windows platform, adjusts binary naming, and disables clang+lld env flags on Windows. |
README.md |
Adds Windows Support section with requirements, build steps, env vars, and limitations. |
pyproject.toml |
Adds Windows classifier. |
CMakeLists.txt |
Avoids setting clang+lld build env var on Windows. |
amd_triton_npu/backend/driver.py |
Windows-aware tool/binary discovery, XRT detection, JIT compilation changes (MSVC), and .pyd dispatch module caching. |
amd_triton_npu/backend/compiler.py |
Adds .exe suffix handling for Windows binaries. |
.gitignore |
Ignores common Windows build artifacts (.obj, .lib, .dll, .pyd, etc.). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| compile_flags = [ | ||
| "cl.exe", | ||
| "/std:c++latest", | ||
| "/Zc:__cplusplus", | ||
| "/EHsc", | ||
| "/LD", | ||
| f"/Fe:{so_path}", |
There was a problem hiding this comment.
The driver invokes cl.exe directly for JIT compilation. On most Windows setups, cl.exe is not on PATH unless the process is launched from a VS Developer Command Prompt (vcvars). Consider detecting cl.exe (e.g., via vswhere) and invoking it through vcvars, or raise a clear error explaining how to set up the MSVC environment, otherwise native Windows runs will fail at runtime.
| for p in default_paths: | ||
| # Prefer directories that contain the development files | ||
| # (include/xrt/ and lib/) needed for JIT compilation | ||
| if os.path.isdir(os.path.join(p, "include", "xrt")): | ||
| return p | ||
|
|
||
| # Fallback: accept any existing path (runtime-only SDK) | ||
| for p in default_paths: | ||
| if os.path.isdir(p): | ||
| return p |
There was a problem hiding this comment.
_get_xrt_path() is documented as returning the XRT development directory (headers + import lib), but the Windows fallback explicitly returns a runtime-only directory even when it lacks include/xrt. Since later compilation unconditionally uses ${xrt_dir}\include and ${xrt_dir}\lib, returning a runtime-only path will lead to a harder-to-diagnose compiler error. Consider keeping the fallback strict (require include/xrt and lib) and raising a targeted exception if only a runtime installation is found.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…path fixes - _get_xrt_path(): Validate SDK has include/xrt and lib/ directories; raise targeted error if only runtime-only installation found - _find_msvc_cl(): Auto-detect cl.exe via vswhere when not on PATH; clear error with setup instructions if MSVC not found - _get_msvc_env(): Derive INCLUDE/LIB/PATH from cl.exe location and Windows SDK when not running from VS Developer Command Prompt - Strip Windows \\\\?\\ long-path prefix from cache paths before passing to JIT-compiled module - Simplify XRT path detection: auto-detect C:\\Program Files\\AMD\\xrt - Clean up debug prints in ELF launcher to use verbosity guard - Update README.md, env_setup.ps1, build_windows.ps1 for simplified XRT setup (no XILINX_XRT env var required)
| $mlirWhl = Get-ChildItem "$mlirWheelDir\mlir-*.whl" | Select-Object -First 1 | ||
| if (-not $mlirWhl) { | ||
| Write-Error "No MLIR wheel matching 'mlir-*.whl' was found in $mlirWheelDir." | ||
| Write-Host "Contents of $mlirWheelDir:" -ForegroundColor Yellow |
There was a problem hiding this comment.
At D:\triton-xdna\utils\build_windows.ps1:235 char:37
+ Write-Host "Contents of $mlirWheelDir:" -ForegroundColor ...
+ ~~~~~~~~~~~~~~
Variable reference is not valid. ':' was not followed by a valid variable name character. Consider using ${} to
delimit the name.
+ CategoryInfo : ParserError: (:) [], ParseException
+ FullyQualifiedErrorId : InvalidVariableReferenceWithDrive
|
Attempting to build locally failed because it could not find |
|
Yeah, I kinda gave up after finding out that driver support wasn't even ready for windows yet. |
| exit 1 | ||
| } else { | ||
| Write-Host " Extracting $($mlirWhl.Name)..." | ||
| Expand-Archive $mlirWhl.FullName -DestinationPath $BuildDir -Force |
There was a problem hiding this comment.
| Expand-Archive $mlirWhl.FullName -DestinationPath $BuildDir -Force | |
| python -m zipfile --extract $mlirWhl.FullName $BuildDir |
Ok, well I encountered a lot of problems trying to use |
|
Ah, sorry. I might revisit it in a bit, just don't have much time at the moment. |
Hi @rwfsmith, thanks for working on this! Have you tried installing the Windows driver from https://ryzenai.docs.amd.com/en/latest/inst.html? |
|
This XRT SDK might be helpful for elf format: https://github.com/Xilinx/XRT/releases/tag/2.21.75 |
|
WiP for Windows on mlir-air: we are working towards releasing Windows wheels from mlir-air Xilinx/mlir-air#1491 which we'd like to use in triton-xdna on Windows. We need to proceed to verify the Windows wheel generated from mlir-air, but I'm posting it here for information. |
this is the one that I was trying to use. Seems like the issue I was hitting was related to an overload that is missing from xrt_core.dll, which comes from the Windows NPU driver package, but it is available in the Linux XDNA driver. |
|
The windows build is working for me in Powershell 7, but when running a test I get this error: (venv) (base) PS C:\projects\Triton-XDNA\examples\vec-add> python vec-add.py /std:c++latest is provided as a preview of language features from the latest C++ main.cxx /dll ****** Bootgen v2023.2 [INFO] : Bootimage generated successfully ****** Bootgen v2023.2 [INFO] : Bootimage generated successfully ****** Bootgen v2023.2 [INFO] : Bootimage generated successfully Compilation completed successfully This completes successfully when I set this, though: $env:AMD_TRITON_NPU_OUTPUT_FORMAT="xclbin" (venv) (base) PS C:\projects\Triton-XDNA\examples\vec-add> python vec-add.py |
In our compiler pipeline, we currently supports two output binary formats, which are "full-elf" (default) and "xclbin". The "full-elf" mode is a faster device reconfiguration mechanism than "xclbin", due to using the hardware microcontroller vs. XRT software mechanism. We will continue to investigate the Windows XRT SDK with "full-elf".
This switches back to the "xclbin" mode. The printout indeed looks promising, as it indicates it is working on hardware, and passing correctness check vs the torch CPU reference. Thank you for your interest in our project and working on the Windows support! Really appreciate it. |
|
The build process will be much smoother after all of the dependencies have pre-built wheels available, but the Triton-XDNA project seems promising, and I look forward to full Windows support :). |
Yesterday we released the first mlir-air windows wheels, which are expected to work with the latest mlir-aie windows wheels. I haven't got a chance to verify that wheel combination yet, but I thought I'd mention them here, in case if you'd like to try them out. |
Looks like the windows wheels all have .pyd files for Python 3.14 in them, instead of the version matching the wheel. This does not appear to be the case for the Linux wheels, however. The 3.12 package version is installed with the below results: There doesn't appear to even be a 3.14 Windows wheel to use as a workaround. |
|
Was the py_limited_api intended to be used? I've seen this before if that's the case and it was due to the I'm going to try building again tomorrow. When triton-xdna is built, does it get tied to the version of pytorch it was built with, or can I build it with torch 2.10 and then use 2.12 (nightly) for example. I want to use it with rocm torch, so I'm just curious if I'll need to force it to build with a rocm torch or not. (If you don't know, it's not really important and I'm sure what I'm asking isn't exactly clear either.) |
It doesn't seem to pin a specific version of torch when built. I built with torch 2.10+ROCm from TheRock nightlies, removed and installed plain torch 2.11 and the xclbin test worked fine. |
|
Hi @rwfsmith, good catch! The issue was that our wheel build's CMake configuration set Fixed in Xilinx/mlir-air#1497. New wheels with correct suffixes are now available at https://github.com/Xilinx/mlir-air/releases/tag/latest-air-wheels-no-rtti. |
…n_shared backend copy in env_setup.ps1 - Add utils/mlir-aie-hash-windows.txt, mlir-air-hash-windows.txt, llvm-aie-hash-windows.txt with pinned wheel versions for Windows (cp312 wheels from their respective --find-links pages) - setup.py: use platform-specific hash files on Windows; add triton-windows to install_requires; add _is_triton_installed() and _copy_backend_to_triton() helpers - env_setup.ps1: after pip install -e . --no-build-isolation, explicitly copy third_party/triton_shared/backend into the installed triton's backends/ directory, since setuptools build_meta does not invoke the custom TritonXdnaDevelop cmdclass (fixes: ModuleNotFoundError: No module named 'triton.backends.triton_shared')
|
Thanks for reporting this issue, @astrelsky. It seems to indicate that AIR dialect was not properly registered in |
Oh that's interesting. I'll give it another go tomorrow, or whenever that lands, and report back. |
Ok we have landed the fix in mlir-air, and also updated the mlir-air hash in #37. I hope that this patch helps fix the issue you encountered. |
It did fix that problem. There are some problems with jit compilation though. I only updated the mlir hash and wheel and nothing else yet due to local changes to use triton-windows instead of triton as necessary. I'll pull in the latest updates in an hour or two and try again. I should be able to work through the jit problems and determine the cause if it's not fixed in changes I haven't pulled in yet. |
|
Ok @erwei-xilinx, so the jit error was an easy fix and could be related to changes in this pull request or might not be (I'm not sure if I should have this or abandon them). The jit compilation error was due to I was getting other errors, but on a hunch I decided to copy over the other The mention of |
Thanks for the debug info, @astrelsky. This is very useful. I think you now encounter the same issue as @rwfsmith reported above, that the Windows XRT SDK is returning an error with the "full-elf" binary format. @rwfsmith mentioned that they are able to get around that issue with:
Which sets the TRITON-XDNA compiler to generate "xclbin" output format instead. |
The lib folder was not included in the
Isn't this fun? 😂 |
|
Thanks @astrelsky. Re: boost_program_options.lib missing. We investigated and confirmed that neither mlir-aie, mlir-air, nor Triton-XDNA actually depend on boost. The -lboost_program_options -lboost_filesystem link flags in our JIT code need to be cleaned up. So this error shouldn't block you, and I think it could work by just removing those link flags. |
It is still missing Also I confirmed the jit compilation doesn't require being in the developer command prompt, so you can ignore that. |
| compile_flags = [ | ||
| cl_path, | ||
| "/std:c++latest", | ||
| "/Zc:__cplusplus", |
There was a problem hiding this comment.
| "/Zc:__cplusplus", | |
| "/MD", | |
| "/Zc:__cplusplus", |
This is needed to link with the provided test_utils.lib in mlir-aie during jit compilation.
| "boost_program_options.lib", | ||
| "boost_filesystem.lib", |
There was a problem hiding this comment.
| "boost_program_options.lib", | |
| "boost_filesystem.lib", |
These have been removed.
|
@erwei-xilinx I took the I'm attaching the full test logs below, it's large. I saw some memory allocation error in there about not having enough available, which is complete nonsense since I have 128GB with 63.6GB available to the NPU. Memory allocation on the npu is currently beyond my knowledge, the logs indicate it is more complicated then I believed when I initially made that comment. |
Thanks for reporting your progress! Failure Let me take a closer look at the timeout one. |
|
@astrelsky The timeout seems to be due to the previous failed matmul run leaving the NPU hardware in a bad state (hw context). My repeated local tests with Btw I also updated the mlir-aie hash to point to the fixed wheel now #39 |
I will run it again by itself this afternoon. |
I ran the tests for big log |
|
fwiw, this test does take a long time. The cpu is churning the whole time (low % usage of less than 10% but high boost frequency around 4.50GHz). When running all the tests yesterday, I think the highest % usage of the NPU was around 4%. It wasn't 0 and was definitely seeing some usage. Now I'm questioning if this is the expected behavior or not and if maybe the cpu is doing work the npu should be doing? It's hard to rationalize since I think it's supposed to be a comparison of output results between the cpu and npu. |
Thanks for the detailed information. The warning messages Looking at the log, I can see the smaller GEMM shapes finish end-to-end on NPUs, but the big ones, once triggering the warning message, seem to fail. The extra CPU usage is definitely not the expected behavior. Let me debug this, and investigate if this is specific to Windows/Linux XRT driver difference. |
I had noticed the same thing when I was testing it. I had a second branch with a lot of performance changes to it, which managed to get the NPU usage up around 80% or so and had a full LLM inference benchmark, but it was still only getting around 4 tokens per second. |
|
Thanks for the details on the i8 matmul example issue, @astrelsky and @rwfsmith. Let's move the discussion around that error to this issue: #50 |
|
Hi @rwfsmith and @astrelsky, shall we proceed to rebase the PR to main and work towards getting it landed? Happy to provide any help I can. |
Ah, sorry, I've been in the middle of a move and my machines aren't set up yet. I should have it all set up later this week and can work on this some more. |
You'll want to have a look at this here. I'm not sure how you initially proceeded with getting it to build without using triton-windows, but I think you may have problems using normal triton on windows. Using triton-windows will ensure that the rest of triton works too, so it can also be used with rocm. The only other thing I found that isn't in the patches above is that the mlir files aren't added to the wheel. I'll put a patch here for that tomorrow night and then you should be able to rebase it and add those changes and it should work. I'll need to figure out how they get added in the Linux wheels though since they didn't seem to be added on windows using |
If you need/want me to take over let me know. I'll open another pull request with your commits and put necessary changes on top so it will still show as your contribution. I won't be as available to discuss during the day because I'm being forced back into the office but when I get home from work I can do whatever is necessary. |
That would probably be a good idea. It's taking me a bit longer than I expected to get set up in the new place. I don't get forced back to the office until July. Was supposed to be February, but we don't have enough office space so we need to get moved to a new building. |
That was my exact situation for 1.5 years 😂. I'll open a new pull request tonight. |
I have to push it off until this weekend because I hit unexpected problems and am now very cranky by the time I get home 😂. I should be able to figure it out Saturday morning though. |

Platform changes (driver.py, setup.py, CMakeLists.txt):
MSVC JIT compilation (cl.exe) replacing GCC for NPU dispatch shims
Windows path handling throughout compiler and driver (.exe suffixes, path separators)
Dynamic library extension detection (.pyd on Windows, .so on Linux)
XRT SDK auto-detection via XILINX_XRT env var with Program Files fallback
Removed Linux-only platform gate; added Windows to supported OS list
Build infrastructure (new files):
build_windows.ps1: End-to-end automated build script
env_setup.ps1: Windows environment setup with pinned wheel versions
Documentation:
Added Windows Support section to README.md (requirements, build steps, environment variables, known limitations)
Tested on:
Known issues:
Seems like the Windows NPU driver is missing an overload to work with ELF, but it works with xclbinutil