Conversation
|
|
|
|
This is great! A few high-level comments before a deeper review can happen (note that I'm not the maintainer of this repo, so feel free to ignore):
|
@jammm I'm surprised to see an AMD dev here. Hope @eliotwang considers your points so us AMD users can benefit from SageAttention V2 in the future. |
|
can this be made to work with rdna2 too ? previously sage-attention 1 worked with rdna2 but not anymore with 7.x |
rocWMMA is specific to those GPUs which support WMMA or MFMA, so probably not. |
Tried it on my gfx1200 with ROCm 7, but as I have seen, it includes the To be clear: This can’t possibly run on Windows, right? Not because of |
TheRock does bundle rocWMMA now (not for all archs though; see https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake for rocWMMA listed under |
Oh. I couldn’t find a folder/file whose name contains "rocwmma" anywhere on my PC using a program like WizFile. How am I supposed to locate it? |
It was added a while ago ROCm/TheRock#1938. rocWMMA has been moved to rocm-libraries now. |
I see, thanks for the explanation. I suppose the gfx1200 should be supported, so I’ll take another look. Yeah, so So basically, there isn’t yet a |
Add --pre to your pip arguments? |
|
We have Nevermind: Now I actually have |
|
Once again got: Because of: SageAttention/csrc/qattn/rocm/sgattn.hip Line 36 in bed8991 This file is no longer included with rocWMMA 2 for ROCm 7. I assume this PR was created using rocWMMA 1.7 for ROCm 6.4 or earlier. Additionally, there are several other compatibility issues stemming from AMD’s changes to rocWMMA 2. |
|
There's a guide here on migrating to 2.0, if that helps https://rocm.docs.amd.com/projects/rocWMMA/en/latest/conceptual/migration-guide.html |
In fact, I tried implementing SageAttention with rocWMMA 2.0. However, frustratingly, using CogVideoX-2B as an example: on the 9070, the end-to-end performance dropped from being ~30% faster than SageAttention v1 to essentially on par with SageAttention v1; on MI300X, the end-to-end performance slightly regressed. The reason for the slowdown is not clear yet. From a performance perspective, I’d still recommend basing SageAttention on rocWMMA 1.7. |
I see. Sadly, I'm Windows-only, and we don't have ROCm 6.4 from TheRock there. The earliest available version is 7, which comes with rocWMMA 2.0. I'm kind of curious what an AMD dev would say about this performance regression, because theoretically it shouldn't happen. |
maybe @jammm can help out with that :) |
Okay, this is SageAttention implemented using rocWMMA 2.0 (gfx12-only support and not clean code). Hope it helps you.https://github.com/eliotwang/sgattn_rocwmma2.0 |
WoW Huge thanks for taking the time to do this! |
|
I guess this couldn't work for RDNA3 yet. |
|
I was able to build Sage 2 with rocWMMA 2 on Windows (ROCm 7) using #332 (comment), but cosine similarity is too low. |
Yeah, according to the document, only RDNA 12 support FP8. https://rocm.docs.amd.com/projects/rocWMMA/en/latest/api-reference/api-reference-guide.html |
I guess it could work like ampere with int8 and int4 as sageattention2. If it can utilized int4, there would be quite a bit improvement as well. |
You can try this: thu-ml/SpargeAttn#108 |
Thank you for your suggestions! I’ve made changes as much as possible based on your feedback and have partially unified the CUDA/ROCm code. Regarding your suggestion to keep the kernel launch style consistent with CUDA: I think cudaFuncSetAttribute(kernel, ...) has special handling for kernel symbols on CUDA, while HIP does not provide an equivalent implicit conversion. Therefore, I’m keeping the original approach. |
Hello, I used the code you provided for the 9070ct graphics card in the Linux rocm7.11 environment and successfully compiled Sage 2. This is very exciting. But I always have memory errors when using Sage 2 attention in Comfyui. I don't know if you know what caused it. The following is the error message. Thank you very much. loaded partially; 8125.29 MB usable, 8042.93 MB loaded, 8496.66 MB offloaded, 75.01 MB buffer reserved, lowvram patches: 0 |
This is a rocWMMA-based implementation of SageAttention. Its interface uses the rocWMMA::fragment API, rather than a PTX-based implementation like CUDA.
Performance: taking CogVideoX-2B as an example, it delivers ~30% better end-to-end performance on the 9070 compared to SageAttention V1.