Skip to content

Clean up kernels Templates#666

Merged
Intron7 merged 2 commits into
mainfrom
clean-up-kernels-templates
May 16, 2026
Merged

Clean up kernels Templates#666
Intron7 merged 2 commits into
mainfrom
clean-up-kernels-templates

Conversation

@Intron7
Copy link
Copy Markdown
Member

@Intron7 Intron7 commented May 16, 2026

Cleaning up dtype dispatching and functions. Switched out /sqrt for the more performant *rsqrt. Confirmed no speed regression vs old implementation.

@Intron7
Copy link
Copy Markdown
Member Author

Intron7 commented May 16, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Performance & Optimization
    • Optimized CUDA kernels to use unified C++ math overloads instead of type-specific functions.
    • Updated normalization operations to use reciprocal square root for improved efficiency in precision-tolerant computations.

Walkthrough

This PR refactors CUDA kernels across five modules to remove manual type_traits-based float/double branching and replace it with C++ overload resolution. Additionally, three kernels switch norm computation from sqrt-based to rsqrt-based inverse-norm approaches for efficiency. Release notes document these changes.

Changes

CUDA Math Overload Refactoring and rsqrt Optimization

Layer / File(s) Summary
Harmony clustering math overload simplification
src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh
Removed type_traits include. Simplified entropy_kernel and diversity_kernel to call log() directly on typed expressions instead of using if constexpr to select logf vs log.
Harmony penalty math overload simplification
src/rapids_singlecell/_cuda/harmony/pen/kernels_pen.cuh
Removed type_traits include. Simplified penalty_kernel to use pow() and fused_pen_norm_kernel to use exp() directly instead of selecting powf/expf for float types.
Harmony L2 normalization via rsqrt
src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh
Removed type_traits include. Replaced sqrt()-based norm computation with rsqrt()-based inverse norm, adding a maximum clamp on the reciprocal normalization factor.
NN descent cosine distance via rsqrt
src/rapids_singlecell/_cuda/nn_descent/kernels_dist.cuh
Changed compute_distances_cosine_kernel from computing norms via sqrtf and dividing by computed denominator to computing inverse norms inv_norm_i1 and inv_norm_i2 via rsqrtf with zero guards, updating output formula to 1.0f - dot * inv_norm_i1 * inv_norm_i2.
PR residual normalization via rsqrt
src/rapids_singlecell/_cuda/pr/kernels_pr.cuh
Changed dense_norm_res_kernel to scale residual r using rsqrt() multiplication instead of sqrt() division before clamping and assignment.
Release notes documentation
docs/release-notes/0.15.1.md
Added entry documenting CUDA kernel math changes: C++ overload usage and 1/sqrt-to-rsqrt optimization for precision-tolerant cases.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Clean up kernels Templates' is directly related to the main changes, which refactor kernel code to remove dtype dispatching and simplify math function calls across multiple CUDA kernel files.
Description check ✅ Passed The description accurately describes the changeset: cleaning up dtype dispatching and functions, replacing division by sqrt with multiplication by rsqrt, and confirming no performance regression.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch clean-up-kernels-templates

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh (1)

44-46: 💤 Low value

Consider extracting magic numbers to named constants.

The threshold T(1e12) should be defined as a named constant for clarity and maintainability. This also applies to line 97's T(1e-12) in the L1 kernel (though unchanged in this PR).

The numerical approach is sound: rsqrt(0) returns +inf, which is then clamped to prevent overflow during scaling.

Suggested improvement
+constexpr float MAX_INV_NORM = 1e12f;
+
 template <typename T>
 __global__ void l2_row_normalize_kernel(const T* __restrict__ src,
 ...
         if (threadIdx.x == 0) {
             T inv_norm = rsqrt(val);
-            if (inv_norm > T(1e12)) inv_norm = T(1e12);
+            if (inv_norm > T(MAX_INV_NORM)) inv_norm = T(MAX_INV_NORM);
             warp_sums[0] = inv_norm;
         }

As per coding guidelines: "All numeric literals for... heuristic thresholds MUST be defined as named constants."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh` around
lines 44 - 46, Extract the magic literal T(1e12) into a named constant and use
it in place of the literal in the normalize kernel: define a descriptive
constexpr (e.g., INF_INV_NORM_CLAMP) and replace the T(1e12) occurrences around
inv_norm/warp_sums[0]; also add a similarly named constant for the L1 kernel's
T(1e-12) (e.g., L1_EPS) and replace that literal where used so both thresholds
are clear and maintainable (refer to symbols inv_norm, warp_sums, and the L1
kernel constant usage).
src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh (1)

46-46: ⚡ Quick win

Extract the entropy epsilon into a named constant.

Line 46 still hard-codes the stabilization threshold, which makes tuning harder and keeps this kernel out of guideline compliance.

♻️ Proposed cleanup
 template <typename T>
 __global__ void entropy_kernel(const T* __restrict__ R, T sigma, int n_cells,
                                int n_clusters, T* __restrict__ obj_out) {
+    constexpr T kEntropyLogEps = T(1e-12);
     int row = blockIdx.x;
     if (row >= n_cells) return;
@@
-        entropy += x * log(x + T(1e-12));
+        entropy += x * log(x + kEntropyLogEps);
     }

As per coding guidelines, All numeric literals for block sizes, tile dimensions, shared memory sizes, and heuristic thresholds MUST be defined as named constants (e.g., constexpr int BLOCK_SIZE = 256;), not raw numbers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh` at
line 46, The entropy stabilization literal 1e-12 is hard-coded in the expression
"entropy += x * log(x + T(1e-12));" — define a named constant (e.g., constexpr
auto ENTROPY_EPS = T(1e-12) or static constexpr T ENTROPY_EPS = T(1e-12)) near
the top of the kernel file or inside the same translation unit and replace the
literal with ENTROPY_EPS so the line becomes "entropy += x * log(x +
ENTROPY_EPS);" to satisfy the guideline for named heuristic thresholds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/rapids_singlecell/_cuda/pr/kernels_pr.cuh`:
- Around line 76-82: The computation can produce NaN when mu == 0 because
rsqrt(0) is +inf and 0 * +inf -> NaN; update the block computing r (using
symbols mu, X[res_index], rsqrt, inv_theta, residuals, clip, sums_genes,
sums_cells, inv_inv_sum_total) to guard against denom == 0: compute denom = mu +
mu*mu*inv_theta and if denom is zero (or extremely small) set r = 0 (or clamp to
safe value) instead of calling rsqrt; otherwise compute r = (X[res_index] - mu)
* rsqrt(denom) and then apply the existing clip logic so residuals[res_index]
never becomes NaN/Inf.

---

Nitpick comments:
In `@src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh`:
- Line 46: The entropy stabilization literal 1e-12 is hard-coded in the
expression "entropy += x * log(x + T(1e-12));" — define a named constant (e.g.,
constexpr auto ENTROPY_EPS = T(1e-12) or static constexpr T ENTROPY_EPS =
T(1e-12)) near the top of the kernel file or inside the same translation unit
and replace the literal with ENTROPY_EPS so the line becomes "entropy += x *
log(x + ENTROPY_EPS);" to satisfy the guideline for named heuristic thresholds.

In `@src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh`:
- Around line 44-46: Extract the magic literal T(1e12) into a named constant and
use it in place of the literal in the normalize kernel: define a descriptive
constexpr (e.g., INF_INV_NORM_CLAMP) and replace the T(1e12) occurrences around
inv_norm/warp_sums[0]; also add a similarly named constant for the L1 kernel's
T(1e-12) (e.g., L1_EPS) and replace that literal where used so both thresholds
are clear and maintainable (refer to symbols inv_norm, warp_sums, and the L1
kernel constant usage).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a22eb85-f698-4bcd-a4f4-c21645d06479

📥 Commits

Reviewing files that changed from the base of the PR and between 1d92126 and d7adbbe.

📒 Files selected for processing (6)
  • docs/release-notes/0.15.1.md
  • src/rapids_singlecell/_cuda/harmony/clustering/kernels_clustering.cuh
  • src/rapids_singlecell/_cuda/harmony/normalize/kernels_normalize.cuh
  • src/rapids_singlecell/_cuda/harmony/pen/kernels_pen.cuh
  • src/rapids_singlecell/_cuda/nn_descent/kernels_dist.cuh
  • src/rapids_singlecell/_cuda/pr/kernels_pr.cuh

Comment on lines 76 to 82
T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;
long long res_index = static_cast<long long>(cell) * n_genes + gene;
T r = X[res_index] - mu;
r /= sqrt(mu + mu * mu * inv_theta);
r *= rsqrt(mu + mu * mu * inv_theta);
if (r < -clip) r = -clip;
if (r > clip) r = clip;
residuals[res_index] = r;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Potential NaN when mu = 0 and X = mu.

When mu = 0 (e.g., gene or cell with zero sum), rsqrt(0) = +inf. If X[res_index] also equals zero, then r = 0 and 0 * inf = NaN. NaN values will not be clamped by lines 80-81 (since NaN comparisons are always false).

This is not a regression from the rsqrt change (the original / sqrt(0) would also produce NaN), but worth considering a guard for robustness.

Suggested guard
     T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;
     long long res_index = static_cast<long long>(cell) * n_genes + gene;
     T r = X[res_index] - mu;
-    r *= rsqrt(mu + mu * mu * inv_theta);
+    T var = mu + mu * mu * inv_theta;
+    if (var > T(0)) r *= rsqrt(var);
     if (r < -clip) r = -clip;
     if (r > clip) r = clip;
     residuals[res_index] = r;

As per coding guidelines: "add epsilon checks before division, handle... NaN/Inf in input data."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;
long long res_index = static_cast<long long>(cell) * n_genes + gene;
T r = X[res_index] - mu;
r /= sqrt(mu + mu * mu * inv_theta);
r *= rsqrt(mu + mu * mu * inv_theta);
if (r < -clip) r = -clip;
if (r > clip) r = clip;
residuals[res_index] = r;
T mu = sums_genes[gene] * sums_cells[cell] * inv_inv_sum_total;
long long res_index = static_cast<long long>(cell) * n_genes + gene;
T r = X[res_index] - mu;
T var = mu + mu * mu * inv_theta;
if (var > T(0)) r *= rsqrt(var);
if (r < -clip) r = -clip;
if (r > clip) r = clip;
residuals[res_index] = r;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/rapids_singlecell/_cuda/pr/kernels_pr.cuh` around lines 76 - 82, The
computation can produce NaN when mu == 0 because rsqrt(0) is +inf and 0 * +inf
-> NaN; update the block computing r (using symbols mu, X[res_index], rsqrt,
inv_theta, residuals, clip, sums_genes, sums_cells, inv_inv_sum_total) to guard
against denom == 0: compute denom = mu + mu*mu*inv_theta and if denom is zero
(or extremely small) set r = 0 (or clamp to safe value) instead of calling
rsqrt; otherwise compute r = (X[res_index] - mu) * rsqrt(denom) and then apply
the existing clip logic so residuals[res_index] never becomes NaN/Inf.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.64%. Comparing base (1d92126) to head (d7adbbe).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #666   +/-   ##
=======================================
  Coverage   88.64%   88.64%           
=======================================
  Files          98       98           
  Lines        7361     7361           
=======================================
  Hits         6525     6525           
  Misses        836      836           

@Intron7 Intron7 merged commit b7e6577 into main May 16, 2026
19 of 26 checks passed
@Intron7 Intron7 deleted the clean-up-kernels-templates branch May 16, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants