Added support and fixed parallel scan in CSR kernels for Blackwell (SM_120) architecture. Added extra CUDA matrix tests.#2012
Conversation
…M_120) architecture. Added extra CUDA matrix tests.
|
I did a quick check and indeed, this fixes all the tests that were failing in #1981 on the NVIDIA GB10. |
|
Hi @spiralbit thanks for your contribution and welcome to Ginkgo. For AI-assistant, at least in my opinion, the contributor needs to understand the code (how and why), and verify the reference. |
|
For background, here is the last part of Claude's reasoning for this fix. Prior to this section there were a lot of trials on various other code paths. In this section there were a number of test runs to check if the proposed parallel fix would work, which Claude found they would not on Blackwell:
|
Good catch! I've pushed a fix for that.
For me Claude is great help in understanding the code and finding the source of test fails. I know a little CUDA and linear math, but I'm still learning. I would like a clean run of tests so I can get properly started on running Ginkgo and analysing it. I have another PR open here, could you please take a look: #2009
I've pasted in Claude's background cogitations. Hopefully that is helpful. I think we haven't really uncovered the real underlying bug. Why should this parallel code fail under Blackwell? That's quite a deep question, but it would be good to debug it fully. It would be good to understand because there might be other parts of the code where it is also failing, but in a way which is not immediately obvious. I've raised this PR to highlight the issue and provoke discussion. We can accept this fix, or use it to look deeper. |
|
Actually good news, in a way, I've done some more testing and it seems that the bug is triggered by the |
|
I've raised an official bug report to NVIDIA here: https://developer.nvidia.com/bugs/6155374. If they acknowledge and fix this then the workaround in this PR may become superfluous. |
These changes fix the issues reported here; #1981, although the architecture in that report is different to mine (Blackwell), the failure mode is the same. It would be good to know if the fixes here fix the DGX Spark issues.
This is my first real contribution to the code base. I'm still learning my way around so likely some more experienced people here might like some different changes, but it was fun to investigate and fix the matrix issues thrown up by NVIDIA's Blackwell SM_120 architecture.
Claude Code was used extensively for the CSR kernel code analysis and did a lot of the heavy lifting in terms of iterating over the various possibilities of the source of the error. Claude's suggested fix of avoiding a parallel scan in the block_segment_scan_reverse function is I hope acceptable. Claude said:
block_segment_scan_reverse parallel scan broken on sm_120 (line ~105): The Hillis-Steele parallel prefix scan with shared memory read-modify-write inside a __forceinline__ function produces wrong values on Blackwell/NVCC 13.2. Replaced with a serial scan by thread 0 — correct on all architectures, with negligible performance impact (SpMV is memory-bound; the scan is a tiny fraction).I don't share Claude's conviction that it is "correct on all architectures" because I don't have all architectures here on my desktop!
Anyway, this is a good starting point for a discussion about potential fixes, if they don't take this exact form - at least Claude has found the source of the issue. Both matrix_cuda and csr_kernels2_cuda tests are now fixed on Blackwell with these fixes.
I also prompted Claude to create a suite of tests to expose the problem and they are included in this PR. They could be made hardware agnostic I suppose - I would appreciate some guidance on how to do that. They would perhaps be more meaningful like that, or perhaps it's better to have CUDA specific ones. Anyway, I will be happy to move, remove or rewrite them.