Skip to content

Sage Attention supports minimal modifications when compiling MS Visual Studio on Windows.#323

Open
mengqin wants to merge 8 commits intothu-ml:mainfrom
mengqin:main
Open

Sage Attention supports minimal modifications when compiling MS Visual Studio on Windows.#323
mengqin wants to merge 8 commits intothu-ml:mainfrom
mengqin:main

Conversation

@mengqin
Copy link

@mengqin mengqin commented Dec 8, 2025

I made minimal modifications to the Windows compilation of Sage Attention, only modifying setup.py and the setup.py of Sage attn 3, as well as a header file in attn3, which allowed it to compile successfully in a Windows + VS2022 environment.

I tested the compilation successfully on torch 2.6.0-2.9.1 and cuda 12.4-13.0.

@woct0rdho
Copy link

Hi, did you successfully run sageattn3 on Windows with Blackwell GPU? Currently a lot of people are seeing errors like CUDA error: misaligned address, see the discussion at woct0rdho#42 .

Once this is fixed, I think it's straightforward to add Python ABI3 and libtorch stable ABI to sageattn3.

@mengqin
Copy link
Author

mengqin commented Dec 14, 2025

Hi, did you successfully run sageattn3 on Windows with Blackwell GPU? Currently a lot of people are seeing errors like CUDA error: misaligned address, see the discussion at woct0rdho#42 .

Once this is fixed, I think it's straightforward to add Python ABI3 and libtorch stable ABI to sageattn3.

Yes, I've realized that although my branch's source code can compile SageAttention3 with CUDA 12.8 + PyTorch 2.8, it still crashes.
I fixed this error locally last week and performed some preliminary tests.

The problem is quite complex and triggers a series of issues:

  1. First, the missing /Zc:__cplus_plus flag caused some macros in the Cutlass library to not be enabled correctly. This is a classic problem; it means that although MSVS passed the C++17 standard, some C++17 features were incorrectly disabled due to macro definition checks, leading to data alignment issues and ultimately causing the crash. This is the root cause.
  2. However, after enabling /Zc:__cplus_plus, the problem became more complex. While this solved the issue on CUDA 12.8, it caused a build break on CUDA 13.0 and SageAttention 2.2.0. The main reason is CUTE_GRID_CONSTANT. When this macro is correctly enabled, some kernel parameters require 128-byte alignment, which conflicts with the MSVC compiler, which only supports 16-byte alignment.
  3. To fix this, I need to modify the parameter passing method of some kernels. This is not a simple modification; it requires changing the kernel to pass parameters by pointer, instead of packing parameters. This means using cudamalloc, which might severely impact performance. However, I might have to do this.

After making these modifications locally, the performance of SageAttention3 seems to have decreased, but SageAttention 2.2.0 seems to be faster than before? I haven't figured out why yet. But their speeds are similar on my 5090; I'm not sure if this is correct.

If I can't do better, I might upload my code first.

简单说,我本地改好了,但是改的有点多,跟我想的保持最小修改有点冲突,而且不确定这么改会不会降低太多效率。我可以先提交,大家可以先看看。

@woct0rdho
Copy link

That's great news! I guess there should be a way to do 128-byte alignment in MSVC (and when MSVC is called in nvcc). What if you add something like alignas(128) or __declspec(align(128)) to the data structure that needs the alignment?

@mengqin
Copy link
Author

mengqin commented Dec 14, 2025

That's great news! I guess there should be a way to do 128-byte alignment in MSVC (and when MSVC is called in nvcc). What if you add something like alignas(128) or __declspec(align(128)) to the data structure that needs the alignment?

The problem isn't data structure alignment, but rather kernel function parameter alignment. Some macros in CUTLASS use alignas(128) alignment specifiers under CUDA 13.0, which is not allowed for function parameters in MSVC. MSVC only allows 16-byte alignment for function parameters. Honestly, I don't think there's a direct solution. I can only modify the relevant kernel interfaces, changing from passing by value to passing by pointer.
I have already submitted the code; you can refer to it.

@tlennon-ie
Copy link

Thank you for your PR, i was able to build the following:

  • SageAttention: 3 (Blackwell)
  • Python: 3.13.x (cp313)
  • PyTorch: 2.9.0+cu130
  • CUDA Toolkit: 13.0
  • Platform: Windows x86_64

If anyone else wants it, here's the uploaded wheel if it suits your versions:
https://huggingface.co/tlennon-ie/sageattn3-1.0.0-py3.13-torch2.9.0cu130-cuda130-win_amd64.whl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants