Clean up kernels Templates#666
Conversation
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR refactors CUDA kernels across five modules to remove manual ChangesCUDA Math Overload Refactoring and rsqrt Optimization
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh (1)
44-46: 💤 Low valueConsider extracting magic numbers to named constants.
The threshold
T(1e12)should be defined as a named constant for clarity and maintainability. This also applies to line 97'sT(1e-12)in the L1 kernel (though unchanged in this PR).The numerical approach is sound:
rsqrt(0)returns+inf, which is then clamped to prevent overflow during scaling.Suggested improvement
+constexpr float MAX_INV_NORM = 1e12f; + template <typename T> __global__ void l2_row_normalize_kernel(const T* __restrict__ src, ... if (threadIdx.x == 0) { T inv_norm = rsqrt(val); - if (inv_norm > T(1e12)) inv_norm = T(1e12); + if (inv_norm > T(MAX_INV_NORM)) inv_norm = T(MAX_INV_NORM); warp_sums[0] = inv_norm; }As per coding guidelines: "All numeric literals for... heuristic thresholds MUST be defined as named constants."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh` around lines 44 - 46, Extract the magic literal T(1e12) into a named constant and use it in place of the literal in the normalize kernel: define a descriptive constexpr (e.g., INF_INV_NORM_CLAMP) and replace the T(1e12) occurrences around inv_norm/warp_sums[0]; also add a similarly named constant for the L1 kernel's T(1e-12) (e.g., L1_EPS) and replace that literal where used so both thresholds are clear and maintainable (refer to symbols inv_norm, warp_sums, and the L1 kernel constant usage).src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh (1)
46-46: ⚡ Quick winExtract the entropy epsilon into a named constant.
Line 46 still hard-codes the stabilization threshold, which makes tuning harder and keeps this kernel out of guideline compliance.
♻️ Proposed cleanup
template <typename T> __global__ void entropy_kernel(const T* __restrict__ R, T sigma, int n_cells, int n_clusters, T* __restrict__ obj_out) { + constexpr T kEntropyLogEps = T(1e-12); int row = blockIdx.x; if (row >= n_cells) return; @@ - entropy += x * log(x + T(1e-12)); + entropy += x * log(x + kEntropyLogEps); }As per coding guidelines,
All numeric literals for block sizes, tile dimensions, shared memory sizes, and heuristic thresholds MUST be defined as named constants (e.g., constexpr int BLOCK_SIZE = 256;), not raw numbers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh` at line 46, The entropy stabilization literal 1e-12 is hard-coded in the expression "entropy += x * log(x + T(1e-12));" — define a named constant (e.g., constexpr auto ENTROPY_EPS = T(1e-12) or static constexpr T ENTROPY_EPS = T(1e-12)) near the top of the kernel file or inside the same translation unit and replace the literal with ENTROPY_EPS so the line becomes "entropy += x * log(x + ENTROPY_EPS);" to satisfy the guideline for named heuristic thresholds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/rapids_singlecell/_cuda/pr/kernels_pr.cuh`:
- Around line 76-82: The computation can produce NaN when mu == 0 because
rsqrt(0) is +inf and 0 * +inf -> NaN; update the block computing r (using
symbols mu, X[res_index], rsqrt, inv_theta, residuals, clip, sums_genes,
sums_cells, inv_inv_sum_total) to guard against denom == 0: compute denom = mu +
mu*mu*inv_theta and if denom is zero (or extremely small) set r = 0 (or clamp to
safe value) instead of calling rsqrt; otherwise compute r = (X[res_index] - mu)
* rsqrt(denom) and then apply the existing clip logic so residuals[res_index]
never becomes NaN/Inf.
---
Nitpick comments:
In `@src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh`:
- Line 46: The entropy stabilization literal 1e-12 is hard-coded in the
expression "entropy += x * log(x + T(1e-12));" — define a named constant (e.g.,
constexpr auto ENTROPY_EPS = T(1e-12) or static constexpr T ENTROPY_EPS =
T(1e-12)) near the top of the kernel file or inside the same translation unit
and replace the literal with ENTROPY_EPS so the line becomes "entropy += x *
log(x + ENTROPY_EPS);" to satisfy the guideline for named heuristic thresholds.
In `@src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh`:
- Around line 44-46: Extract the magic literal T(1e12) into a named constant and
use it in place of the literal in the normalize kernel: define a descriptive
constexpr (e.g., INF_INV_NORM_CLAMP) and replace the T(1e12) occurrences around
inv_norm/warp_sums[0]; also add a similarly named constant for the L1 kernel's
T(1e-12) (e.g., L1_EPS) and replace that literal where used so both thresholds
are clear and maintainable (refer to symbols inv_norm, warp_sums, and the L1
kernel constant usage).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6a22eb85-f698-4bcd-a4f4-c21645d06479
📒 Files selected for processing (6)
docs/release-notes/0.15.1.mdsrc/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuhsrc/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuhsrc/rapids_singlecell/_cuda/harmony/pen/kernels_pen.cuhsrc/rapids_singlecell/_cuda/nn_descent/kernels_dist.cuhsrc/rapids_singlecell/_cuda/pr/kernels_pr.cuh
| T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total; | ||
| long long res_index = static_cast<long long>(cell) * n_genes + gene; | ||
| T r = X[res_index] - mu; | ||
| r /= sqrt(mu + mu * mu * inv_theta); | ||
| r *= rsqrt(mu + mu * mu * inv_theta); | ||
| if (r < -clip) r = -clip; | ||
| if (r > clip) r = clip; | ||
| residuals[res_index] = r; |
There was a problem hiding this comment.
Potential NaN when mu = 0 and X = mu.
When mu = 0 (e.g., gene or cell with zero sum), rsqrt(0) = +inf. If X[res_index] also equals zero, then r = 0 and 0 * inf = NaN. NaN values will not be clamped by lines 80-81 (since NaN comparisons are always false).
This is not a regression from the rsqrt change (the original / sqrt(0) would also produce NaN), but worth considering a guard for robustness.
Suggested guard
T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;
long long res_index = static_cast<long long>(cell) * n_genes + gene;
T r = X[res_index] - mu;
- r *= rsqrt(mu + mu * mu * inv_theta);
+ T var = mu + mu * mu * inv_theta;
+ if (var > T(0)) r *= rsqrt(var);
if (r < -clip) r = -clip;
if (r > clip) r = clip;
residuals[res_index] = r;As per coding guidelines: "add epsilon checks before division, handle... NaN/Inf in input data."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total; | |
| long long res_index = static_cast<long long>(cell) * n_genes + gene; | |
| T r = X[res_index] - mu; | |
| r /= sqrt(mu + mu * mu * inv_theta); | |
| r *= rsqrt(mu + mu * mu * inv_theta); | |
| if (r < -clip) r = -clip; | |
| if (r > clip) r = clip; | |
| residuals[res_index] = r; | |
| T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total; | |
| long long res_index = static_cast<long long>(cell) * n_genes + gene; | |
| T r = X[res_index] - mu; | |
| T var = mu + mu * mu * inv_theta; | |
| if (var > T(0)) r *= rsqrt(var); | |
| if (r < -clip) r = -clip; | |
| if (r > clip) r = clip; | |
| residuals[res_index] = r; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/rapids_singlecell/_cuda/pr/kernels_pr.cuh` around lines 76 - 82, The
computation can produce NaN when mu == 0 because rsqrt(0) is +inf and 0 * +inf
-> NaN; update the block computing r (using symbols mu, X[res_index], rsqrt,
inv_theta, residuals, clip, sums_genes, sums_cells, inv_inv_sum_total) to guard
against denom == 0: compute denom = mu + mu*mu*inv_theta and if denom is zero
(or extremely small) set r = 0 (or clamp to safe value) instead of calling
rsqrt; otherwise compute r = (X[res_index] - mu) * rsqrt(denom) and then apply
the existing clip logic so residuals[res_index] never becomes NaN/Inf.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #666 +/- ##
=======================================
Coverage 88.64% 88.64%
=======================================
Files 98 98
Lines 7361 7361
=======================================
Hits 6525 6525
Misses 836 836 |
Cleaning up dtype dispatching and functions. Switched out /sqrt for the more performant *rsqrt. Confirmed no speed regression vs old implementation.