Clean up kernels Templates by Intron7 · Pull Request #666 · scverse/rapids-singlecell

Intron7 · 2026-05-16T11:53:53Z

Cleaning up dtype dispatching and functions. Switched out /sqrt for the more performant *rsqrt. Confirmed no speed regression vs old implementation.

Intron7 · 2026-05-16T11:55:51Z

@coderabbitai review

coderabbitai · 2026-05-16T11:55:56Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-05-16T11:58:24Z

📝 Walkthrough

Summary by CodeRabbit

Performance & Optimization
- Optimized CUDA kernels to use unified C++ math overloads instead of type-specific functions.
- Updated normalization operations to use reciprocal square root for improved efficiency in precision-tolerant computations.

Walkthrough

This PR refactors CUDA kernels across five modules to remove manual type_traits-based float/double branching and replace it with C++ overload resolution. Additionally, three kernels switch norm computation from sqrt-based to rsqrt-based inverse-norm approaches for efficiency. Release notes document these changes.

Changes

CUDA Math Overload Refactoring and rsqrt Optimization

Layer / File(s)	Summary
Harmony clustering math overload simplification `src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh`	Removed `type_traits` include. Simplified `entropy_kernel` and `diversity_kernel` to call `log()` directly on typed expressions instead of using `if constexpr` to select `logf` vs `log`.
Harmony penalty math overload simplification `src/rapids_singlecell/_cuda/harmony/pen/kernels_pen.cuh`	Removed `type_traits` include. Simplified `penalty_kernel` to use `pow()` and `fused_pen_norm_kernel` to use `exp()` directly instead of selecting `powf`/`expf` for `float` types.
Harmony L2 normalization via rsqrt `src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh`	Removed `type_traits` include. Replaced `sqrt()`-based norm computation with `rsqrt()`-based inverse norm, adding a maximum clamp on the reciprocal normalization factor.
NN descent cosine distance via rsqrt `src/rapids_singlecell/_cuda/nn_descent/kernels_dist.cuh`	Changed `compute_distances_cosine_kernel` from computing norms via `sqrtf` and dividing by computed denominator to computing inverse norms `inv_norm_i1` and `inv_norm_i2` via `rsqrtf` with zero guards, updating output formula to `1.0f - dot * inv_norm_i1 * inv_norm_i2`.
PR residual normalization via rsqrt `src/rapids_singlecell/_cuda/pr/kernels_pr.cuh`	Changed `dense_norm_res_kernel` to scale residual `r` using `rsqrt()` multiplication instead of `sqrt()` division before clamping and assignment.
Release notes documentation `docs/release-notes/0.15.1.md`	Added entry documenting CUDA kernel math changes: C++ overload usage and 1/sqrt-to-rsqrt optimization for precision-tolerant cases.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Clean up kernels Templates' is directly related to the main changes, which refactor kernel code to remove dtype dispatching and simplify math function calls across multiple CUDA kernel files.
Description check	✅ Passed	The description accurately describes the changeset: cleaning up dtype dispatching and functions, replacing division by sqrt with multiplication by rsqrt, and confirming no performance regression.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch clean-up-kernels-templates

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh (1)
44-46: 💤 Low value

Consider extracting magic numbers to named constants.

The threshold T(1e12) should be defined as a named constant for clarity and maintainability. This also applies to line 97's T(1e-12) in the L1 kernel (though unchanged in this PR).

The numerical approach is sound: rsqrt(0) returns +inf, which is then clamped to prevent overflow during scaling.
Suggested improvement
+constexpr float MAX_INV_NORM = 1e12f;
+
 template <typename T>
 __global__ void l2_row_normalize_kernel(const T* __restrict__ src,
 ...
         if (threadIdx.x == 0) {
             T inv_norm = rsqrt(val);
-            if (inv_norm > T(1e12)) inv_norm = T(1e12);
+            if (inv_norm > T(MAX_INV_NORM)) inv_norm = T(MAX_INV_NORM);
             warp_sums[0] = inv_norm;
         }
As per coding guidelines: "All numeric literals for... heuristic thresholds MUST be defined as named constants."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh` around
lines 44 - 46, Extract the magic literal T(1e12) into a named constant and use
it in place of the literal in the normalize kernel: define a descriptive
constexpr (e.g., INF_INV_NORM_CLAMP) and replace the T(1e12) occurrences around
inv_norm/warp_sums[0]; also add a similarly named constant for the L1 kernel's
T(1e-12) (e.g., L1_EPS) and replace that literal where used so both thresholds
are clear and maintainable (refer to symbols inv_norm, warp_sums, and the L1
kernel constant usage).
src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh (1)
46-46: ⚡ Quick win

Extract the entropy epsilon into a named constant.

Line 46 still hard-codes the stabilization threshold, which makes tuning harder and keeps this kernel out of guideline compliance.
♻️ Proposed cleanup
 template <typename T>
 __global__ void entropy_kernel(const T* __restrict__ R, T sigma, int n_cells,
                                int n_clusters, T* __restrict__ obj_out) {
+    constexpr T kEntropyLogEps = T(1e-12);
     int row = blockIdx.x;
     if (row >= n_cells) return;
@@
-        entropy += x * log(x + T(1e-12));
+        entropy += x * log(x + kEntropyLogEps);
     }
As per coding guidelines, All numeric literals for block sizes, tile dimensions, shared memory sizes, and heuristic thresholds MUST be defined as named constants (e.g., constexpr int BLOCK_SIZE = 256;), not raw numbers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh` at
line 46, The entropy stabilization literal 1e-12 is hard-coded in the expression
"entropy += x * log(x + T(1e-12));" — define a named constant (e.g., constexpr
auto ENTROPY_EPS = T(1e-12) or static constexpr T ENTROPY_EPS = T(1e-12)) near
the top of the kernel file or inside the same translation unit and replace the
literal with ENTROPY_EPS so the line becomes "entropy += x * log(x +
ENTROPY_EPS);" to satisfy the guideline for named heuristic thresholds.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/rapids_singlecell/_cuda/pr/kernels_pr.cuh`:
- Around line 76-82: The computation can produce NaN when mu == 0 because
rsqrt(0) is +inf and 0 * +inf -> NaN; update the block computing r (using
symbols mu, X[res_index], rsqrt, inv_theta, residuals, clip, sums_genes,
sums_cells, inv_inv_sum_total) to guard against denom == 0: compute denom = mu +
mu*mu*inv_theta and if denom is zero (or extremely small) set r = 0 (or clamp to
safe value) instead of calling rsqrt; otherwise compute r = (X[res_index] - mu)
* rsqrt(denom) and then apply the existing clip logic so residuals[res_index]
never becomes NaN/Inf.

---

Nitpick comments:
In `@src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh`:
- Line 46: The entropy stabilization literal 1e-12 is hard-coded in the
expression "entropy += x * log(x + T(1e-12));" — define a named constant (e.g.,
constexpr auto ENTROPY_EPS = T(1e-12) or static constexpr T ENTROPY_EPS =
T(1e-12)) near the top of the kernel file or inside the same translation unit
and replace the literal with ENTROPY_EPS so the line becomes "entropy += x *
log(x + ENTROPY_EPS);" to satisfy the guideline for named heuristic thresholds.

In `@src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh`:
- Around line 44-46: Extract the magic literal T(1e12) into a named constant and
use it in place of the literal in the normalize kernel: define a descriptive
constexpr (e.g., INF_INV_NORM_CLAMP) and replace the T(1e12) occurrences around
inv_norm/warp_sums[0]; also add a similarly named constant for the L1 kernel's
T(1e-12) (e.g., L1_EPS) and replace that literal where used so both thresholds
are clear and maintainable (refer to symbols inv_norm, warp_sums, and the L1
kernel constant usage).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a22eb85-f698-4bcd-a4f4-c21645d06479

📥 Commits

Reviewing files that changed from the base of the PR and between 1d92126 and d7adbbe.

📒 Files selected for processing (6)

docs/release-notes/0.15.1.md
src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh
src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh
src/rapids_singlecell/_cuda/harmony/pen/kernels_pen.cuh
src/rapids_singlecell/_cuda/nn_descent/kernels_dist.cuh
src/rapids_singlecell/_cuda/pr/kernels_pr.cuh

coderabbitai · 2026-05-16T11:58:27Z

    T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;
    long long res_index = static_cast<long long>(cell) * n_genes + gene;
    T r = X[res_index] - mu;
-    r /= sqrt(mu + mu * mu * inv_theta);
+    r *= rsqrt(mu + mu * mu * inv_theta);
    if (r < -clip) r = -clip;
    if (r > clip) r = clip;
    residuals[res_index] = r;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Potential NaN when mu = 0 and X = mu.

When mu = 0 (e.g., gene or cell with zero sum), rsqrt(0) = +inf. If X[res_index] also equals zero, then r = 0 and 0 * inf = NaN. NaN values will not be clamped by lines 80-81 (since NaN comparisons are always false).

This is not a regression from the rsqrt change (the original / sqrt(0) would also produce NaN), but worth considering a guard for robustness.

Suggested guard

T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total; long long res_index = static_cast<long long>(cell) * n_genes + gene; T r = X[res_index] - mu; - r *= rsqrt(mu + mu * mu * inv_theta); + T var = mu + mu * mu * inv_theta; + if (var > T(0)) r *= rsqrt(var); if (r < -clip) r = -clip; if (r > clip) r = clip; residuals[res_index] = r;

As per coding guidelines: "add epsilon checks before division, handle... NaN/Inf in input data."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;

long long res_index = static_cast<long long>(cell) * n_genes + gene;

T r = X[res_index] - mu;

r /= sqrt(mu + mu * mu * inv_theta);

r *= rsqrt(mu + mu * mu * inv_theta);

if (r < -clip) r = -clip;

if (r > clip) r = clip;

residuals[res_index] = r;

T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;

long long res_index = static_cast<long long>(cell) * n_genes + gene;

T r = X[res_index] - mu;

T var = mu + mu * mu * inv_theta;

if (var > T(0)) r *= rsqrt(var);

if (r < -clip) r = -clip;

if (r > clip) r = clip;

residuals[res_index] = r;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/rapids_singlecell/_cuda/pr/kernels_pr.cuh` around lines 76 - 82, The computation can produce NaN when mu == 0 because rsqrt(0) is +inf and 0 * +inf -> NaN; update the block computing r (using symbols mu, X[res_index], rsqrt, inv_theta, residuals, clip, sums_genes, sums_cells, inv_inv_sum_total) to guard against denom == 0: compute denom = mu + mu*mu*inv_theta and if denom is zero (or extremely small) set r = 0 (or clamp to safe value) instead of calling rsqrt; otherwise compute r = (X[res_index] - mu) * rsqrt(denom) and then apply the existing clip logic so residuals[res_index] never becomes NaN/Inf.

codecov-commenter · 2026-05-16T12:13:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.64%. Comparing base (1d92126) to head (d7adbbe).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #666   +/-   ##
=======================================
  Coverage   88.64%   88.64%           
=======================================
  Files          98       98           
  Lines        7361     7361           
=======================================
  Hits         6525     6525           
  Misses        836      836

Intron7 added 2 commits May 16, 2026 13:52

Clean up kernels Templates

37eb0fe

add release note

d7adbbe

Intron7 added the run-gpu-ci label May 16, 2026

github-actions Bot removed the run-gpu-ci label May 16, 2026

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

Intron7 merged commit b7e6577 into main May 16, 2026
19 of 26 checks passed

Intron7 deleted the clean-up-kernels-templates branch May 16, 2026 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up kernels Templates#666

Clean up kernels Templates#666
Intron7 merged 2 commits into
mainfrom
clean-up-kernels-templates

Intron7 commented May 16, 2026

Uh oh!

Intron7 commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 16, 2026

Uh oh!

codecov-commenter commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Intron7 commented May 16, 2026

Uh oh!

Intron7 commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026

Uh oh!

coderabbitai Bot commented May 16, 2026

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented May 16, 2026 •

edited

Loading