JIT LTO Cagra Search by divyegala · Pull Request #1807 · rapidsai/cuvs

divyegala · 2026-02-15T23:43:51Z

CUDA 13 binary size reduction from 282 MB to 257 MB (-8.86%).

Benchmark:

Apply updates from
CAGRA related PRs:

JIT related PRs:

Refactor JIT LTO kernel generation #1812

KyleFromNVIDIA · 2026-04-30T14:35:11Z

+
+using args_t = typename dataset_descriptor_base_t<data_t, index_t, distance_t>::args_t;
+template __device__ distance_t
+apply_normalization_standard<@team_size@, @dataset_block_dim@, data_t, index_t, distance_t, query_t>(distance_t,


There's probably room to turn this into an adapter function, remove team_size and dataset_block_dim from the signature, and thus shrink down whatever calls it, but I'm happy to do that in a follow-up.

This is not too large of a concern, it is not part of the main kernel. It is linked to a device function (that links to the main kernel) that already does not have these templates.

KyleFromNVIDIA · 2026-05-01T18:33:58Z

On the whole, I love this. The one other overarching comment I'll give is that there are lots of small changes that seem to be unrelated to the purpose of the PR - comments and blank lines added, etc. Unless there's a good reason for adding them, I think we should try to keep the diff as minimal as possible - this is already a huge PR as it is.

KyleFromNVIDIA

There's still a few minor stylistic updates I'd like to make, but I'll do them myself in a follow-up PR. I don't want to hold this up any longer.

copy-pr-bot · 2026-05-07T17:28:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

KyleFromNVIDIA · 2026-05-07T17:40:53Z

/ok to test 6cb3e04

dantegd

Had some more questions but nothing major

dantegd · 2026-05-08T14:19:21Z

+    const uint32_t query_id_offset = bf.query_id_offset;
+
+    // set kernel launch parameters
+    dim3 gs = calc_coop_grid_size(block_size, smem_size, persistent_device_usage);


Wait, am I reading this right that we never call cudaFuncSetAttribute(..., cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) for the persistent kernel anymore? With that gone:

calc_coop_grid_size calls cudaOccupancyMaxActiveBlocksPerMultiprocessor(launcher->get_kernel(), block_size, smem_size) against the default 48 KB cap, so as soon as we have a config with smem_size > 48 KB (high itopk_size × bitonic merge buffers, or VPQ on dataset_block_dim=512), the occupancy answer is going to be wrong.

And the actual dispatch_cooperative at line 606 should fail with cudaErrorInvalidValue for those same configs.

Was this intentional, or did it get lost when this path moved to JIT? If it's the latter, I think the easiest fix is just adding RAFT_CUDA_TRY(cudaFuncSetAttribute(launcher->get_kernel(), cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)); right before calc_coop_grid_size here, or running the launch through a cooperative-aware variant of safely_launch_kernel_with_smem_size. Also — would it be possible to add a regression test with smem_size > 48 KB (something like itopk_size=512, search_width=4, VPQ on dim=512)? I think the existing tests all stay under the cap, which is why nothing is catching this.

The original PR #1771 that introduced it only did it for single_cta non-persistent and multi_cta. I wish to not deviate from main as much as possible.

dantegd · 2026-05-08T14:23:59Z

+  // The dispatch mechanism uses void* pointers, so parameter sizes must match exactly
+  const uint32_t ldr_u32 = static_cast<uint32_t>(ldr);
+
+  launcher->dispatch<random_pickup_kernel_func_t<DataT, IndexT, DistanceT>>(


Related to the comment above, the multi-CTA search launcher and the single-CTA non-persistent path both go through safely_launch_kernel_with_smem_size, but these three helpers just call launcher->dispatch<…>(…, dataset_desc.smem_ws_size_in_bytes, …) directly.

I think today these workspaces stay under 48 KB, so it's not broken. Any reason not to wrap these dispatches in safely_launch_kernel_with_smem_size for symmetry and potentially future proofing?

Same as above.

dantegd · 2026-05-08T14:30:59Z

+    return uint64_t(graph.data_handle()) ^ uint64_t(source_indices_ptr) ^
+           dataset_desc.get().team_size ^ num_itopk_candidates ^ block_size ^ smem_size ^
+           hash_bitlen ^ small_hash_reset_interval ^ num_random_samplings ^ rand_xor_mask ^
+           num_seeds ^ itopk_size ^ search_width ^ min_iterations ^ max_iterations ^
+           uint64_t(persistent_lifetime * 1000) ^ uint64_t(persistent_device_usage * 1000);
+  }


A few things about this hash that I'd like to think through with you:

Pure XOR is commutative, so I'm pretty sure (itopk_size=64, search_width=128) and (itopk_size=128, search_width=64) collide today, and that's the kind of swap that probably does happen across calls. With persistent kernels a collision means we silently reuse the wrong runner. Should we mix with rotations or use boost::hash_combine style?

topk_by_bitonic_sort and bitonic_sort_and_merge_multi_warps come out of compute_launch_config and end up as JIT template parameters, but I don't see them in the hash. If compute_launch_config flips one of those when itopk_size crosses 256, won't we keep using the previous runner? Am I missing where these get folded in?

We hash dataset_desc.team_size, but not dataset_block_dim, is_vpq, pq_bits, pq_len, or metric — and all of those are now planner inputs. Should they be in the hash too?

It is the status quo in main currently

cuvs/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh

Lines 1926 to 1931 in 93fb5dc

return uint64_t(graph.data_handle()) ^ uint64_t(source_indices_ptr) ^

dataset_desc.get().team_size ^ num_itopk_candidates ^ block_size ^ smem_size ^

hash_bitlen ^ small_hash_reset_interval ^ num_random_samplings ^ rand_xor_mask ^

num_seeds ^ itopk_size ^ search_width ^ min_iterations ^ max_iterations ^

uint64_t(persistent_lifetime * 1000) ^ uint64_t(persistent_device_usage * 1000);

}

dantegd · 2026-05-08T14:35:25Z

+  explicit CagraPlannerBase(std::string entrypoint, LauncherJitCache& jit_cache)
+    : AlgorithmPlanner(std::move(entrypoint), jit_cache)
+  {
+    linktime_extra_options.push_back("-maxrregcount=64");


Regarding this again, the option applies to the whole link unit, so apply_filter_kernel, random_pickup, compute_distance_to_child_nodes, and the search kernels themselves all run capped at 64 registers now which didn't have this cap before, no? Could we add a comment here explaining why every linked CAGRA fragment runs at 64 registers? I think a future maintainer is going to look at this and assume it's a bug.

Also, just realized that linktime_extra_options is protected state in the base class with a comment saying "derived planners may append … in their constructor body"). That's a fragile invariant where if a derived class accidentally writes to it after build() runs, it'd silently use a stale option. Would it be cleaner to take it as a constructor arg, or expose a virtual extra_link_options() hook?

Okay, I went through this again and this option is only applied to the descriptors currently:

cuvs/cpp/CMakeLists.txt

Lines 328 to 331 in 93fb5dc

set_source_files_properties(

${cagra_compute_distance_standard_inst_files} ${cagra_compute_distance_vpq_inst_files}

PROPERTIES COMPILE_FLAGS -maxrregcount=64

)

So instead of supplying it as a link-time option, I added it as a compile-time option to the fragments that relate to descriptor usage: setup_workspace and compute_distance. This should bring us to parity now.

I'm commenting this after the PR is merged, so just for the history: maxreggcount was only necessary due to separable compilation and thus probably should be removed from the jit-lto version. Perhaps this will improve the search perf against the original!

…-search-jit-lto

divyegala · 2026-05-08T20:03:27Z

/ok to test 047ef38

divyegala · 2026-05-09T21:17:08Z

/merge

…earch Adapt the branch's multi-segment / multi-partition CAGRA additions to the new JIT-LTO kernel infrastructure landed in rapidsai#1807. After the merge, multi-partition search runs through JIT-linked fragments just like the rest of CAGRA, with parity across filter types (none, bitset, mp_bitset). - Port deleted *_kernel-inl.cuh contents into the JIT layout: device bodies in jit_lto_kernels/*_jit.cuh, .cu.in entry-points, matrix JSONs, fragment tags, planners, factory functions, host launchers, CMake registration. - Introduce mp_bitset_filter_data_t + tag_filter_mp_bitset + matching sample_filter_mp_bitset_impl so multi_partition_bitset_filter is recognized end-to-end without coupling to the standard bitset POD. - Add BitsetT template parameter to search_core so it accepts either cagra_bitset or mp_cagra_bitset without doubling instantiations. - Add CUVS_EXPORT to four C entry points that were silently hidden: cuvsRMMAsyncMemoryResourceEnable, cuvsResourcesSetWorkspacePool, cuvsCagraSearchMultiPartition, cuvsSelectK. - Update JDKProvider.java to drop the stale specific import for cudaStreamSynchronize that jextract has reshuffled into headers_h.

divyegala added 30 commits October 5, 2025 03:50

passing tests

eb2d74b

update gitignore

d2318e8

separate out distance function from main kernel

5e6afcd

fix deps

6eee4da

add filters as jit device functions, rework caching logic

1de8f28

lto post lambda, cleanup files, generate cmake in build dir

84c6020

don't read hardcoded kernels, use generator properly

22680c8

random cmake changes carried over from 25.10

37f1163

cmake format

0ae5383

remove dep on kernel list

fe56aec

attempt to solve overlinking problem

40c8fd6

reorder if-else in compiler check

e87a8c7

Merge branch 'branch-25.12' into jit-lto-ivf-flat-interleaved

179d733

use cudart apis

32a67bd

merge

c27612e

attempt to link cudart

a4b48b1

revert cudart link, try all arch build of jit lto fatbin sources

d5d692e

cmake format

1c6dd94

missing shared mem setting

30f5ab6

separate cuda 12 and 13 compilation

9674969

merge upstream

24fc47d

remove bench

db9a487

c include directory

aa9294f

style check

2eb77fe

merge upstream

6c685fa

guard cuda calls and use shared_ptr

3e35b99

add AlgorithmPlanner to main target

d0ff62c

merge upstream

eb87577

remove nvjitlink as cuda 12 dep

445a6c4

address review

92a27d4

KyleFromNVIDIA requested changes May 1, 2026

View reviewed changes

divyegala added 10 commits May 4, 2026 18:31

attempt to fix smem launch

1d58136

dante review

dc29e56

kyle review step 1

f329a82

fix ci error

7f2fa39

Merge remote-tracking branch 'upstream/main' into cagra-search-jit-lto

6598f62

kyle review step 2

2f93c6e

kyle review step 3

030e070

attempt to fix build error

46b6ca6

Merge branch 'main' into cagra-search-jit-lto

e79ef4d

Merge branch 'main' into cagra-search-jit-lto

85ea7a3

KyleFromNVIDIA approved these changes May 7, 2026

View reviewed changes

Merge branch 'main' into cagra-search-jit-lto

6cb3e04

dantegd requested changes May 8, 2026

View reviewed changes

divyegala added 3 commits May 8, 2026 20:00

dante reviews

c96012c

Merge remote-tracking branch 'origin/cagra-search-jit-lto' into cagra…

3789a50

…-search-jit-lto

Merge remote-tracking branch 'upstream/main' into cagra-search-jit-lto

047ef38

dantegd approved these changes May 9, 2026

View reviewed changes

rapids-bot Bot merged commit 90c18a8 into rapidsai:main May 9, 2026
161 of 166 checks passed

github-project-automation Bot moved this to Done in Unstructured Data Processing May 9, 2026

coderabbitai Bot mentioned this pull request May 11, 2026

IVF-SQ C++ API #1865

Merged

This was referenced Jun 4, 2026

Improve CAGRA-Q performance and add support for PQ_LEN=8 #1533

Open

Add JIT-LTO based filter UDF support for CAGRA #2132

Merged

coderabbitai Bot mentioned this pull request Jun 15, 2026

CAGRA Bloom Filter #2236

Open

	return uint64_t(graph.data_handle()) ^ uint64_t(source_indices_ptr) ^
	dataset_desc.get().team_size ^ num_itopk_candidates ^ block_size ^ smem_size ^
	hash_bitlen ^ small_hash_reset_interval ^ num_random_samplings ^ rand_xor_mask ^
	num_seeds ^ itopk_size ^ search_width ^ min_iterations ^ max_iterations ^
	uint64_t(persistent_lifetime * 1000) ^ uint64_t(persistent_device_usage * 1000);
	}

	set_source_files_properties(
	${cagra_compute_distance_standard_inst_files} ${cagra_compute_distance_vpq_inst_files}
	PROPERTIES COMPILE_FLAGS -maxrregcount=64
	)

Conversation

divyegala commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KyleFromNVIDIA commented May 1, 2026

Uh oh!

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

KyleFromNVIDIA commented May 7, 2026

Uh oh!

dantegd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

divyegala commented May 8, 2026

Uh oh!

divyegala commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

divyegala commented Feb 15, 2026 •

edited

Loading