Skip to content

Commit 51817a0

Browse files
authored
Check for kernel errors (#619)
1 parent 6b74ace commit 51817a0

43 files changed

Lines changed: 592 additions & 249 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.coderabbit.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,12 +30,16 @@ reviews:
3030
- Memory access patterns and coalescing
3131
- Correct use of atomicAdd and synchronization
3232
- Template parameter correctness (float vs double)
33+
- MANDATORY: Every kernel launch (<<<grid, block, shared, stream>>>) MUST be followed by cudaGetLastError() to catch launch failures. Flag any kernel launch missing this check.
34+
- MANDATORY: No magic numbers. All block sizes, tile sizes, grid calculations, and thresholds must use named constants (constexpr int BLOCK_SIZE = 256). Flag any raw numeric literal in dim3, grid, or shared memory calculations.
3335
- path: "src/rapids_singlecell/**/_kernels/**"
3436
instructions: |
3537
These are CuPy RawKernel definitions. Review for:
3638
- Correct CUDA kernel launch configurations
3739
- Shared memory bounds
3840
- Type safety (float32 vs float64 mismatches)
41+
- No magic numbers in kernel launch configurations or kernel code. Block sizes, tile sizes, and thresholds must use named constants.
42+
- After RawKernel calls, check for cp.cuda.runtime.getLastError() to catch silent launch failures.
3943
- path: "tests/**"
4044
instructions: |
4145
Do not suggest changing test tolerances without strong justification.

cuda_agents.md

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
### GPU/CUDA Errors
2020
- Race conditions in GPU kernels (shared memory, atomics)
2121
- Invalid memory access (out-of-bounds, host/device confusion)
22-
- Missing CUDA error checking after kernel launches
22+
- **Missing `cudaGetLastError()` after kernel launches**: Every kernel launch (`<<<grid, block, shared, stream>>>`) MUST be followed by `cudaGetLastError()` to detect launch failures (invalid config, shared memory overflow, etc.). Without this, errors are silently deferred and may corrupt later operations or produce garbage results.
2323
- Kernel launch with zero blocks/threads or invalid grid/block dimensions
2424
- **Template type mismatches**: kernel templated on `float` but receiving `double` data from Python
2525
- **Shared memory overflow**: exceeding device shared memory limit (varies by GPU, e.g. T4 = 64KB)
@@ -73,7 +73,7 @@
7373
### Kernel Configuration
7474
- Hard-coded shared memory sizes that may exceed device limits
7575
- Fixed tile sizes that don't adapt to device capabilities
76-
- **Magic numbers** in grid/block calculations without descriptive constants
76+
- **Magic numbers**: all numeric literals for block sizes, tile dimensions, shared memory sizes, and heuristic thresholds MUST use named constants. `dim3 block(256)` is not acceptable — use `constexpr int BLOCK_SIZE = 256; dim3 block(BLOCK_SIZE);`
7777

7878
### Test Quality
7979
- Missing validation of numerical correctness against CPU reference
@@ -141,6 +141,43 @@ int max_shared = device.attributes["MaxSharedMemoryPerBlock"];
141141
int tile = select_tile(max_shared, dtype_size);
142142
```
143143

144+
**CRITICAL** (missing cudaGetLastError):
145+
```text
146+
CRITICAL: Missing cudaGetLastError() after kernel launch
147+
148+
Issue: Kernel launched without error checking — launch failures are silently deferred
149+
Why: Invalid grid/block config, shared memory overflow, or other launch errors go undetected
150+
Impact: Garbage results that look like algorithm bugs, not CUDA errors
151+
152+
Bad:
153+
my_kernel<<<grid, block, shared_mem, stream>>>(...);
154+
155+
Good:
156+
my_kernel<<<grid, block, shared_mem, stream>>>(...);
157+
cudaError_t err = cudaGetLastError();
158+
if (err != cudaSuccess) {
159+
throw std::runtime_error(std::string("Kernel launch failed: ") + cudaGetErrorString(err));
160+
}
161+
```
162+
163+
**HIGH** (magic numbers):
164+
```text
165+
HIGH: Magic numbers in kernel configuration
166+
167+
Issue: `dim3 block(64)` and `dim3 grid((n + 63) / 64)` use raw numeric literals
168+
Why: Obscures intent, error-prone when changing, harder to review
169+
Impact: Maintainability and correctness risk
170+
171+
Bad:
172+
dim3 block(64);
173+
dim3 grid((n + 63) / 64);
174+
175+
Good:
176+
constexpr int BLOCK_SIZE = 64;
177+
dim3 block(BLOCK_SIZE);
178+
dim3 grid((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
179+
```
180+
144181
**CRITICAL** (missing syncthreads):
145182
```text
146183
CRITICAL: Missing __syncthreads() between shared memory write and read
@@ -266,8 +303,9 @@ module_name/
266303
### When Reviewing Nanobind Bindings (.cu files)
267304
- [ ] Is the template type `T` dispatched correctly based on array dtype?
268305
- [ ] Are array dimensions validated before kernel launch?
269-
- [ ] Is error checking done after CUDA calls?
306+
- [ ] Is `cudaGetLastError()` called after every kernel launch to catch launch failures?
270307
- [ ] Are DLPack/array interface conversions correct?
308+
- [ ] Are all numeric literals for block sizes, tile sizes, and thresholds defined as named constants?
271309

272310
### When Reviewing CuPy RawKernels (_kernels/*.py)
273311
- [ ] Is the kernel string syntactically correct CUDA C?

python_agents.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,12 +93,21 @@ At millions of cells, numerical edge cases that "never happen" on small data bec
9393
- Unsafe deserialization of data files
9494
- Missing bounds checking allowing resource exhaustion
9595

96+
### Magic Numbers
97+
- Hard-coded numeric literals (128, 256, 512, 1024, etc.) in kernel configurations, thresholds, or tile sizes without named constants
98+
- Use descriptive constants: `BLOCK_SIZE = 256`, `SHARED_MEM_THRESHOLD = 48 * 1024`
99+
- Tile sizes, block dimensions, and heuristic thresholds must all be named
100+
101+
### Missing Kernel Error Checking
102+
- After calling nanobind CUDA kernel wrappers from Python, the next CuPy operation may silently consume a pending CUDA error
103+
- After RawKernel launches, call `cp.cuda.runtime.getLastError()` to surface launch failures immediately (e.g., shared memory overflow, invalid grid dimensions)
104+
- This is especially important in development and testing — a kernel that silently fails produces garbage results that look like algorithm bugs
105+
96106
## MEDIUM Issues (Comment Selectively)
97107

98108
- Edge cases not handled (empty AnnData, single observation)
99109
- Deprecated API usage
100110
- Minor inefficiencies in non-critical code paths
101-
- Magic numbers without descriptive constant names
102111

103112
## Review Protocol
104113

src/rapids_singlecell/_cuda/aggr/aggr.cu

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,20 @@
55

66
using namespace nb::literals;
77

8+
constexpr int BLOCK_SIZE_SPARSE = 64;
9+
constexpr int BLOCK_SIZE_DENSE = 256;
10+
811
template <typename T>
912
static inline void launch_csr_aggr(const int* indptr, const int* index,
1013
const T* data, double* out, const int* cats,
1114
const bool* mask, size_t n_cells,
1215
size_t n_genes, size_t n_groups,
1316
cudaStream_t stream) {
1417
dim3 grid((unsigned)n_cells);
15-
dim3 block(64);
18+
dim3 block(BLOCK_SIZE_SPARSE);
1619
csr_aggr_kernel<T><<<grid, block, 0, stream>>>(
1720
indptr, index, data, out, cats, mask, n_cells, n_genes, n_groups);
21+
CUDA_CHECK_LAST_ERROR(csr_aggr_kernel);
1822
}
1923

2024
template <typename T>
@@ -24,31 +28,34 @@ static inline void launch_csc_aggr(const int* indptr, const int* index,
2428
size_t n_genes, size_t n_groups,
2529
cudaStream_t stream) {
2630
dim3 grid((unsigned)n_genes);
27-
dim3 block(64);
31+
dim3 block(BLOCK_SIZE_SPARSE);
2832
csc_aggr_kernel<T><<<grid, block, 0, stream>>>(
2933
indptr, index, data, out, cats, mask, n_cells, n_genes, n_groups);
34+
CUDA_CHECK_LAST_ERROR(csc_aggr_kernel);
3035
}
3136

3237
template <typename T>
3338
static inline void launch_dense_aggr_C(const T* data, double* out,
3439
const int* cats, const bool* mask,
3540
size_t n_cells, size_t n_genes,
3641
size_t n_groups, cudaStream_t stream) {
37-
dim3 block(256);
42+
dim3 block(BLOCK_SIZE_DENSE);
3843
dim3 grid((unsigned)((n_cells * n_genes + block.x - 1) / block.x));
3944
dense_aggr_kernel_C<T><<<grid, block, 0, stream>>>(
4045
data, out, cats, mask, n_cells, n_genes, n_groups);
46+
CUDA_CHECK_LAST_ERROR(dense_aggr_kernel_C);
4147
}
4248

4349
template <typename T>
4450
static inline void launch_dense_aggr_F(const T* data, double* out,
4551
const int* cats, const bool* mask,
4652
size_t n_cells, size_t n_genes,
4753
size_t n_groups, cudaStream_t stream) {
48-
dim3 block(256);
54+
dim3 block(BLOCK_SIZE_DENSE);
4955
dim3 grid((unsigned)((n_cells * n_genes + block.x - 1) / block.x));
5056
dense_aggr_kernel_F<T><<<grid, block, 0, stream>>>(
5157
data, out, cats, mask, n_cells, n_genes, n_groups);
58+
CUDA_CHECK_LAST_ERROR(dense_aggr_kernel_F);
5259
}
5360

5461
template <typename T>
@@ -58,19 +65,21 @@ static inline void launch_csr_to_coo(const int* indptr, const int* index,
5865
const bool* mask, int n_cells,
5966
cudaStream_t stream) {
6067
dim3 grid((unsigned)n_cells);
61-
dim3 block(64);
68+
dim3 block(BLOCK_SIZE_SPARSE);
6269
csr_to_coo_kernel<T><<<grid, block, 0, stream>>>(
6370
indptr, index, data, row, col, ndata, cats, mask, n_cells);
71+
CUDA_CHECK_LAST_ERROR(csr_to_coo_kernel);
6472
}
6573

6674
static inline void launch_sparse_var(const int* indptr, const int* index,
6775
double* data, const double* mean_data,
6876
double* n_cells, int dof, int n_groups,
6977
cudaStream_t stream) {
7078
dim3 grid((unsigned)n_groups);
71-
dim3 block(64);
79+
dim3 block(BLOCK_SIZE_SPARSE);
7280
sparse_var_kernel<<<grid, block, 0, stream>>>(
7381
indptr, index, data, mean_data, n_cells, dof, n_groups);
82+
CUDA_CHECK_LAST_ERROR(sparse_var_kernel);
7483
}
7584

7685
template <typename T, typename Device>

src/rapids_singlecell/_cuda/aucell/aucell.cu

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,12 @@ static inline void launch_auc(const int* ranks, int R, int C, const int* cnct,
3636
const int* starts, const int* lens, int n_sets,
3737
int n_up, const float* max_aucs, float* es,
3838
cudaStream_t stream) {
39-
dim3 block(32);
40-
dim3 grid((unsigned)n_sets, (unsigned)((R + block.x - 1) / block.x));
39+
constexpr int BLOCK_SIZE = 32;
40+
dim3 block(BLOCK_SIZE);
41+
dim3 grid((unsigned)n_sets, (unsigned)((R + BLOCK_SIZE - 1) / BLOCK_SIZE));
4142
auc_kernel<<<grid, block, 0, stream>>>(ranks, R, C, cnct, starts, lens,
4243
n_sets, n_up, max_aucs, es);
44+
CUDA_CHECK_LAST_ERROR(auc_kernel);
4345
}
4446

4547
template <typename Device>

src/rapids_singlecell/_cuda/autocorr/autocorr.cu

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,18 +5,23 @@
55

66
using namespace nb::literals;
77

8+
constexpr int DENSE_BLOCK_DIM = 8;
9+
constexpr int SPARSE_BLOCK_SIZE = 1024;
10+
constexpr int ELEMENTWISE_BLOCK_SIZE = 32;
11+
812
template <typename T>
913
static inline void launch_morans_dense(const T* data_centered,
1014
const int* adj_row_ptr,
1115
const int* adj_col_ind,
1216
const T* adj_data, T* num, int n_samples,
1317
int n_features, cudaStream_t stream) {
14-
dim3 block(8, 8);
15-
dim3 grid((n_features + block.x - 1) / block.x,
16-
(n_samples + block.y - 1) / block.y);
18+
dim3 block(DENSE_BLOCK_DIM, DENSE_BLOCK_DIM);
19+
dim3 grid((n_features + DENSE_BLOCK_DIM - 1) / DENSE_BLOCK_DIM,
20+
(n_samples + DENSE_BLOCK_DIM - 1) / DENSE_BLOCK_DIM);
1721
morans_I_num_dense_kernel<<<grid, block, 0, stream>>>(
1822
data_centered, adj_row_ptr, adj_col_ind, adj_data, num, n_samples,
1923
n_features);
24+
CUDA_CHECK_LAST_ERROR(morans_I_num_dense_kernel);
2025
}
2126

2227
template <typename T>
@@ -25,46 +30,50 @@ static inline void launch_morans_sparse(
2530
const int* data_row_ptr, const int* data_col_ind, const T* data_values,
2631
int n_samples, int n_features, const T* mean_array, T* num,
2732
cudaStream_t stream) {
28-
dim3 block(1024);
33+
dim3 block(SPARSE_BLOCK_SIZE);
2934
dim3 grid(n_samples);
3035
morans_I_num_sparse_kernel<<<grid, block, 0, stream>>>(
3136
adj_row_ptr, adj_col_ind, adj_data, data_row_ptr, data_col_ind,
3237
data_values, n_samples, n_features, mean_array, num);
38+
CUDA_CHECK_LAST_ERROR(morans_I_num_sparse_kernel);
3339
}
3440

3541
template <typename T>
3642
static inline void launch_gearys_dense(const T* data, const int* adj_row_ptr,
3743
const int* adj_col_ind,
3844
const T* adj_data, T* num, int n_samples,
3945
int n_features, cudaStream_t stream) {
40-
dim3 block(8, 8);
41-
dim3 grid((n_features + block.x - 1) / block.x,
42-
(n_samples + block.y - 1) / block.y);
46+
dim3 block(DENSE_BLOCK_DIM, DENSE_BLOCK_DIM);
47+
dim3 grid((n_features + DENSE_BLOCK_DIM - 1) / DENSE_BLOCK_DIM,
48+
(n_samples + DENSE_BLOCK_DIM - 1) / DENSE_BLOCK_DIM);
4349
gearys_C_num_dense_kernel<<<grid, block, 0, stream>>>(
4450
data, adj_row_ptr, adj_col_ind, adj_data, num, n_samples, n_features);
51+
CUDA_CHECK_LAST_ERROR(gearys_C_num_dense_kernel);
4552
}
4653

4754
template <typename T>
4855
static inline void launch_gearys_sparse(
4956
const int* adj_row_ptr, const int* adj_col_ind, const T* adj_data,
5057
const int* data_row_ptr, const int* data_col_ind, const T* data_values,
5158
int n_samples, int n_features, T* num, cudaStream_t stream) {
52-
dim3 block(1024);
59+
dim3 block(SPARSE_BLOCK_SIZE);
5360
dim3 grid(n_samples);
5461
gearys_C_num_sparse_kernel<<<grid, block, 0, stream>>>(
5562
adj_row_ptr, adj_col_ind, adj_data, data_row_ptr, data_col_ind,
5663
data_values, n_samples, n_features, num);
64+
CUDA_CHECK_LAST_ERROR(gearys_C_num_sparse_kernel);
5765
}
5866

5967
template <typename T>
6068
static inline void launch_pre_den_sparse(const int* data_col_ind,
6169
const T* data_values, int nnz,
6270
const T* mean_array, T* den,
6371
int* counter, cudaStream_t stream) {
64-
dim3 block(32);
65-
dim3 grid((nnz + block.x - 1) / block.x);
72+
dim3 block(ELEMENTWISE_BLOCK_SIZE);
73+
dim3 grid((nnz + ELEMENTWISE_BLOCK_SIZE - 1) / ELEMENTWISE_BLOCK_SIZE);
6674
pre_den_sparse_kernel<<<grid, block, 0, stream>>>(
6775
data_col_ind, data_values, nnz, mean_array, den, counter);
76+
CUDA_CHECK_LAST_ERROR(pre_den_sparse_kernel);
6877
}
6978

7079
template <typename Device>

src/rapids_singlecell/_cuda/bbknn/bbknn.cu

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,29 @@
55

66
using namespace nb::literals;
77

8+
constexpr int BLOCK_SIZE = 64;
9+
810
static inline void launch_find_top_k_per_row(const float* data,
911
const int* indptr, int n_rows,
1012
int trim, float* vals,
1113
cudaStream_t stream) {
12-
dim3 block(64);
13-
dim3 grid((n_rows + 64 - 1) / 64);
14-
size_t shared_mem_size =
15-
static_cast<size_t>(64) * static_cast<size_t>(trim) * sizeof(float);
14+
dim3 block(BLOCK_SIZE);
15+
dim3 grid((n_rows + BLOCK_SIZE - 1) / BLOCK_SIZE);
16+
size_t shared_mem_size = static_cast<size_t>(BLOCK_SIZE) *
17+
static_cast<size_t>(trim) * sizeof(float);
1618
find_top_k_per_row_kernel<<<grid, block, shared_mem_size, stream>>>(
1719
data, indptr, n_rows, trim, vals);
20+
CUDA_CHECK_LAST_ERROR(find_top_k_per_row_kernel);
1821
}
1922

2023
static inline void launch_cut_smaller(int* indptr, int* index, float* data,
2124
float* vals, int n_rows,
2225
cudaStream_t stream) {
2326
dim3 grid(n_rows);
24-
dim3 block(64);
27+
dim3 block(BLOCK_SIZE);
2528
cut_smaller_kernel<<<grid, block, 0, stream>>>(indptr, index, data, vals,
2629
n_rows);
30+
CUDA_CHECK_LAST_ERROR(cut_smaller_kernel);
2731
}
2832

2933
template <typename Device>

src/rapids_singlecell/_cuda/cooc/cooc.cu

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ static void launch_csr_catpairs_kernel(
122122
<<<grid, block, shared_mem, stream>>>(
123123
spatial, thresholds, cat_offsets, cell_indices, pair_left,
124124
pair_right, counts, k, l_val, blocks_per_pair, l_pad);
125+
CUDA_CHECK_LAST_ERROR(occur_count_kernel_csr_catpairs_tiled);
125126
}
126127

127128
// Dispatch to correct template specialization based on cell_tile
@@ -192,6 +193,7 @@ static inline void launch_count_pairwise(const float* spatial,
192193
dim3 block(32);
193194
occur_count_kernel_pairwise<<<grid, block, 0, stream>>>(
194195
spatial, thresholds, labels, result, n, k, l_val);
196+
CUDA_CHECK_LAST_ERROR(occur_count_kernel_pairwise);
195197
}
196198

197199
// Shared memory reduction launch
@@ -213,6 +215,7 @@ static inline bool launch_reduce_shared(const int* result, float* out, int k,
213215
static_cast<size_t>(k) * static_cast<size_t>(k + 1) * sizeof(float);
214216
occur_reduction_kernel_shared<<<grid, block, smem, stream>>>(result, out, k,
215217
l_val, format);
218+
CUDA_CHECK_LAST_ERROR(occur_reduction_kernel_shared);
216219
return true;
217220
}
218221

@@ -225,6 +228,7 @@ static inline void launch_reduce_global(const int* result, float* inter_out,
225228
size_t smem = static_cast<size_t>(k) * sizeof(float);
226229
occur_reduction_kernel_global<<<grid, block, smem, stream>>>(
227230
result, inter_out, out, k, l_val, format);
231+
CUDA_CHECK_LAST_ERROR(occur_reduction_kernel_global);
228232
}
229233

230234
template <typename Device>

src/rapids_singlecell/_cuda/edistance/edistance.cu

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ static int choose_feat_tile_64(int n_features) {
2828
static int choose_feat_tile(int n_features, size_t max_shared_bytes,
2929
int cell_tile, int dtype_size) {
3030
// Shared memory: cell_tile * feat_tile * dtype_size + warp_sums overhead
31-
size_t warp_sums_overhead = 32 * dtype_size;
31+
size_t warp_sums_overhead = WARP_SIZE * dtype_size;
3232
size_t available_shared = max_shared_bytes - warp_sums_overhead;
3333

3434
int best_tile = 32; // default minimum
@@ -109,6 +109,7 @@ static void launch_edistance_kernel(const T* embedding, const int* cat_offsets,
109109
<<<grid, block, shared_mem, stream>>>(
110110
embedding, cat_offsets, cell_indices, pair_left, pair_right,
111111
pairwise_sums, n_features, blocks_per_pair);
112+
CUDA_CHECK_LAST_ERROR(edistance_kernel);
112113
}
113114

114115
// Dispatch to correct tile size specialization for float32

0 commit comments

Comments
 (0)