Skip to content

Added support and fixed parallel scan in CSR kernels for Blackwell (SM_120) architecture. Added extra CUDA matrix tests.#2012

Open
spiralbit wants to merge 4 commits into
ginkgo-project:developfrom
spiralbit:fix/cuda-matrix-blackwell
Open

Added support and fixed parallel scan in CSR kernels for Blackwell (SM_120) architecture. Added extra CUDA matrix tests.#2012
spiralbit wants to merge 4 commits into
ginkgo-project:developfrom
spiralbit:fix/cuda-matrix-blackwell

Conversation

@spiralbit
Copy link
Copy Markdown

@spiralbit spiralbit commented May 3, 2026

These changes fix the issues reported here; #1981, although the architecture in that report is different to mine (Blackwell), the failure mode is the same. It would be good to know if the fixes here fix the DGX Spark issues.

This is my first real contribution to the code base. I'm still learning my way around so likely some more experienced people here might like some different changes, but it was fun to investigate and fix the matrix issues thrown up by NVIDIA's Blackwell SM_120 architecture.

Claude Code was used extensively for the CSR kernel code analysis and did a lot of the heavy lifting in terms of iterating over the various possibilities of the source of the error. Claude's suggested fix of avoiding a parallel scan in the block_segment_scan_reverse function is I hope acceptable. Claude said:

block_segment_scan_reverse parallel scan broken on sm_120 (line ~105): The Hillis-Steele parallel prefix scan with shared memory read-modify-write inside a __forceinline__ function produces wrong values on Blackwell/NVCC 13.2. Replaced with a serial scan by thread 0 — correct on all architectures, with negligible performance impact (SpMV is memory-bound; the scan is a tiny fraction).

I don't share Claude's conviction that it is "correct on all architectures" because I don't have all architectures here on my desktop!

Anyway, this is a good starting point for a discussion about potential fixes, if they don't take this exact form - at least Claude has found the source of the issue. Both matrix_cuda and csr_kernels2_cuda tests are now fixed on Blackwell with these fixes.

I also prompted Claude to create a suite of tests to expose the problem and they are included in this PR. They could be made hardware agnostic I suppose - I would appreciate some guidance on how to do that. They would perhaps be more meaningful like that, or perhaps it's better to have CUDA specific ones. Anyway, I will be happy to move, remove or rewrite them.

…M_120) architecture. Added extra CUDA matrix tests.
@spiralbit spiralbit marked this pull request as ready for review May 3, 2026 10:58
@Slaedr
Copy link
Copy Markdown
Contributor

Slaedr commented May 4, 2026

I did a quick check and indeed, this fixes all the tests that were failing in #1981 on the NVIDIA GB10.

@yhmtsai
Copy link
Copy Markdown
Member

yhmtsai commented May 4, 2026

Hi @spiralbit thanks for your contribution and welcome to Ginkgo.
(reg_ind != ind[threadIdx.x + 1]) in the code will give out-of-bound access.
because usage pattern on this shared memory is common, we need to figure out why it is wrong and whether we indeed fix the right place. Otherwise, shared memory access will give trouble in the rest of the codes.

For AI-assistant, at least in my opinion, the contributor needs to understand the code (how and why), and verify the reference.
For example, do you find any reference about the codegen lead the issue or minimal reproducer? It will be helpful for the pr and future reference.

@spiralbit
Copy link
Copy Markdown
Author

For background, here is the last part of Claude's reasoning for this fix. Prior to this section there were a lot of trials on various other code paths. In this section there were a number of test runs to check if the proposed parallel fix would work, which Claude found they would not on Blackwell:

The volatile approach doesn't work because ValueType might not support assignment from volatile ValueType&. This is a common issue with complex types. For float and double it should work, but there might be some type-related issue.

Actually, the error is "no operator = matches these operands" for temp = vval[threadIdx.x - i]. This is because vval is volatile ValueType*, so vval[...] returns a volatile ValueType&. For primitive types like float/double, this should work... but maybe not with the zero() template function or other Ginkgo types.

I could try casting away the volatile qualifier or using CUDA's built-in volatile load functions, but the most portable approach would be to use inline assembly for explicit volatile reads from shared memory.

Actually, I'm overcomplicating this—the serial scan approach is simpler and works fine for memory-bound operations like SpMV. Let me just go back to that.

Wait, let me reconsider the volatile semantics more carefully. When I declare temp as a non-volatile ValueType and assign from a volatile pointer, the compiler should handle the volatile read correctly and then store the result into the non-volatile variable. That should work without any special tricks.

The real problem is the += operation on the volatile location—the compiler can't find an overload that accepts a volatile reference on the left side, especially if ValueType is a custom Ginkgo type that doesn't define operator+= for volatile operands.

The cleanest approach is to avoid volatile altogether and use explicit loads and stores instead, leveraging CUDA's cache control instructions for the memory semantics we need. I'm sketching out a version that casts away the volatile qualifier to perform the read, then does a separate volatile write-modify operation.

Still writing the fix...

Actually, I'm realizing the real issue might not be the volatile syntax at all. Since __syncthreads() acts as a full memory fence and prevents the compiler from reordering shared memory accesses across it, the reads after the sync should already see the latest writes. The serial scan works fine, which confirms the sync itself is functioning correctly, so something else must be causing the discrepancy. Let me trace through what the serial scan is actually doing—it's a straightforward sequential loop where each thread compares consecutive indices and accumulates values.

But in the parallel scan, different threads are accessing shared memory across iterations, which requires that synchronization to work properly. If the parallel scan is failing while the serial scan succeeds, and both use the same __syncthreads(), then either there's a Blackwell-specific bug with the synchronization primitive itself, or the issue lies elsewhere in how the parallel algorithm is structured.

I've tested multiple approaches and the parallel scan consistently fails on Blackwell, so I'm going to stick with the serial implementation and revert the volatile changes.

Actually, let me think about this differently—what if I use warp shuffles to avoid the problematic cross-thread shared memory access pattern? I can compute local prefix sums within each warp using register-to-register communication via __shfl_up_sync, then carefully handle the cross-warp communication through shared memory.

Building the warp-shuffle based scan... the 4 warp totals (serially or with 4 threads)

Add the appropriate prefix to each thread's value
But this requires handling SEGMENT BOUNDARIES correctly, which is more complex because segments can cross warp boundaries.

This is getting too complex. Let me just go with the serial scan.

Actually, let me try one more parallel approach: warp-shuffle based scan. The key insight is that within a warp, __shfl_up_sync with a barrier mask (0xffffffff) provides correct cross-thread values without shared memory.

For the segment scan, I'm using shuffle operations to compute prefix sums within each warp, checking segment boundaries as I go. The inter-warp pass would use shared memory to handle values across warp boundaries, but there's a problem: the segment boundary check doesn't work for cross-warp accesses since I can't directly compare indices across warps with shuffle operations.

So I'm splitting this into two stages—within-warp shuffles for the first five iterations, then shared memory for the final two iterations to handle cross-warp segment boundaries. I'm realizing the cross-warp stage still has the same problematic pattern even with larger stride values, so the issue isn't specific to certain i values—the whole parallel scan approach is fundamentally broken on this architecture. I'll just stick with the serial scan implementation since it works reliably and the performance cost should be minimal for typical workloads.

The volatile approach has type-operator issues with generic ValueType. The parallel scan is fundamentally broken on Blackwell sm_120 — not the read-back, but the scan itself. All fix attempts (no unroll, __syncthreads, extra sync after) fail. The serial scan works and is correct. Let me revert to the serial scan as the definitive fix:

@spiralbit
Copy link
Copy Markdown
Author

Hi @spiralbit thanks for your contribution and welcome to Ginkgo. (reg_ind != ind[threadIdx.x + 1]) in the code will give out-of-bound access. because usage pattern on this shared memory is common, we need to figure out why it is wrong and whether we indeed fix the right place. Otherwise, shared memory access will give trouble in the rest of the codes.

Good catch! I've pushed a fix for that.

For AI-assistant, at least in my opinion, the contributor needs to understand the code (how and why), and verify the reference.

For me Claude is great help in understanding the code and finding the source of test fails. I know a little CUDA and linear math, but I'm still learning. I would like a clean run of tests so I can get properly started on running Ginkgo and analysing it. I have another PR open here, could you please take a look: #2009

For example, do you find any reference about the codegen lead the issue or minimal reproducer? It will be helpful for the pr and future reference.

I've pasted in Claude's background cogitations. Hopefully that is helpful. I think we haven't really uncovered the real underlying bug. Why should this parallel code fail under Blackwell? That's quite a deep question, but it would be good to debug it fully. It would be good to understand because there might be other parts of the code where it is also failing, but in a way which is not immediately obvious. I've raised this PR to highlight the issue and provoke discussion. We can accept this fix, or use it to look deeper.

@spiralbit
Copy link
Copy Markdown
Author

Actually good news, in a way, I've done some more testing and it seems that the bug is triggered by the __restrict__ flag on val. For the moment I've reverted the code change in the block segment scan reverse function and just removed the __restrict__. This is the simplest fix for this architecture. I will report this to the NVIDIA folks and see what they think.

@spiralbit
Copy link
Copy Markdown
Author

I've raised an official bug report to NVIDIA here: https://developer.nvidia.com/bugs/6155374. If they acknowledge and fix this then the workaround in this PR may become superfluous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants