JIT LTO Cagra Search#1807
Conversation
|
|
||
| using args_t = typename dataset_descriptor_base_t<data_t, index_t, distance_t>::args_t; | ||
| template __device__ distance_t | ||
| apply_normalization_standard<@team_size@, @dataset_block_dim@, data_t, index_t, distance_t, query_t>(distance_t, |
There was a problem hiding this comment.
There's probably room to turn this into an adapter function, remove team_size and dataset_block_dim from the signature, and thus shrink down whatever calls it, but I'm happy to do that in a follow-up.
There was a problem hiding this comment.
This is not too large of a concern, it is not part of the main kernel. It is linked to a device function (that links to the main kernel) that already does not have these templates.
|
On the whole, I love this. The one other overarching comment I'll give is that there are lots of small changes that seem to be unrelated to the purpose of the PR - comments and blank lines added, etc. Unless there's a good reason for adding them, I think we should try to keep the diff as minimal as possible - this is already a huge PR as it is. |
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
There's still a few minor stylistic updates I'd like to make, but I'll do them myself in a follow-up PR. I don't want to hold this up any longer.
|
/ok to test 6cb3e04 |
dantegd
left a comment
There was a problem hiding this comment.
Had some more questions but nothing major
| const uint32_t query_id_offset = bf.query_id_offset; | ||
|
|
||
| // set kernel launch parameters | ||
| dim3 gs = calc_coop_grid_size(block_size, smem_size, persistent_device_usage); |
There was a problem hiding this comment.
Wait, am I reading this right that we never call cudaFuncSetAttribute(..., cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size) for the persistent kernel anymore? With that gone:
- calc_coop_grid_size calls cudaOccupancyMaxActiveBlocksPerMultiprocessor(launcher->get_kernel(), block_size, smem_size) against the default 48 KB cap, so as soon as we have a config with smem_size > 48 KB (high itopk_size × bitonic merge buffers, or VPQ on dataset_block_dim=512), the occupancy answer is going to be wrong.
- And the actual dispatch_cooperative at line 606 should fail with cudaErrorInvalidValue for those same configs.
Was this intentional, or did it get lost when this path moved to JIT? If it's the latter, I think the easiest fix is just adding RAFT_CUDA_TRY(cudaFuncSetAttribute(launcher->get_kernel(), cudaFuncAttributeMaxDynamicSharedMemorySize, smem_size)); right before calc_coop_grid_size here, or running the launch through a cooperative-aware variant of safely_launch_kernel_with_smem_size. Also — would it be possible to add a regression test with smem_size > 48 KB (something like itopk_size=512, search_width=4, VPQ on dim=512)? I think the existing tests all stay under the cap, which is why nothing is catching this.
There was a problem hiding this comment.
The original PR #1771 that introduced it only did it for single_cta non-persistent and multi_cta. I wish to not deviate from main as much as possible.
| // The dispatch mechanism uses void* pointers, so parameter sizes must match exactly | ||
| const uint32_t ldr_u32 = static_cast<uint32_t>(ldr); | ||
|
|
||
| launcher->dispatch<random_pickup_kernel_func_t<DataT, IndexT, DistanceT>>( |
There was a problem hiding this comment.
Related to the comment above, the multi-CTA search launcher and the single-CTA non-persistent path both go through safely_launch_kernel_with_smem_size, but these three helpers just call launcher->dispatch<…>(…, dataset_desc.smem_ws_size_in_bytes, …) directly.
I think today these workspaces stay under 48 KB, so it's not broken. Any reason not to wrap these dispatches in safely_launch_kernel_with_smem_size for symmetry and potentially future proofing?
| return uint64_t(graph.data_handle()) ^ uint64_t(source_indices_ptr) ^ | ||
| dataset_desc.get().team_size ^ num_itopk_candidates ^ block_size ^ smem_size ^ | ||
| hash_bitlen ^ small_hash_reset_interval ^ num_random_samplings ^ rand_xor_mask ^ | ||
| num_seeds ^ itopk_size ^ search_width ^ min_iterations ^ max_iterations ^ | ||
| uint64_t(persistent_lifetime * 1000) ^ uint64_t(persistent_device_usage * 1000); | ||
| } |
There was a problem hiding this comment.
A few things about this hash that I'd like to think through with you:
- Pure XOR is commutative, so I'm pretty sure (itopk_size=64, search_width=128) and (itopk_size=128, search_width=64) collide today, and that's the kind of swap that probably does happen across calls. With persistent kernels a collision means we silently reuse the wrong runner. Should we mix with rotations or use boost::hash_combine style?
- topk_by_bitonic_sort and bitonic_sort_and_merge_multi_warps come out of compute_launch_config and end up as JIT template parameters, but I don't see them in the hash. If compute_launch_config flips one of those when itopk_size crosses 256, won't we keep using the previous runner? Am I missing where these get folded in?
- We hash dataset_desc.team_size, but not dataset_block_dim, is_vpq, pq_bits, pq_len, or metric — and all of those are now planner inputs. Should they be in the hash too?
There was a problem hiding this comment.
It is the status quo in main currently
cuvs/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
Lines 1926 to 1931 in 93fb5dc
| explicit CagraPlannerBase(std::string entrypoint, LauncherJitCache& jit_cache) | ||
| : AlgorithmPlanner(std::move(entrypoint), jit_cache) | ||
| { | ||
| linktime_extra_options.push_back("-maxrregcount=64"); |
There was a problem hiding this comment.
Regarding this again, the option applies to the whole link unit, so apply_filter_kernel, random_pickup, compute_distance_to_child_nodes, and the search kernels themselves all run capped at 64 registers now which didn't have this cap before, no? Could we add a comment here explaining why every linked CAGRA fragment runs at 64 registers? I think a future maintainer is going to look at this and assume it's a bug.
Also, just realized that linktime_extra_options is protected state in the base class with a comment saying "derived planners may append … in their constructor body"). That's a fragile invariant where if a derived class accidentally writes to it after build() runs, it'd silently use a stale option. Would it be cleaner to take it as a constructor arg, or expose a virtual extra_link_options() hook?
There was a problem hiding this comment.
Okay, I went through this again and this option is only applied to the descriptors currently:
Lines 328 to 331 in 93fb5dc
So instead of supplying it as a link-time option, I added it as a compile-time option to the fragments that relate to descriptor usage: setup_workspace and compute_distance. This should bring us to parity now.
There was a problem hiding this comment.
I'm commenting this after the PR is merged, so just for the history: maxreggcount was only necessary due to separable compilation and thus probably should be removed from the jit-lto version. Perhaps this will improve the search perf against the original!
|
/ok to test 047ef38 |
|
/merge |
…earch Adapt the branch's multi-segment / multi-partition CAGRA additions to the new JIT-LTO kernel infrastructure landed in rapidsai#1807. After the merge, multi-partition search runs through JIT-linked fragments just like the rest of CAGRA, with parity across filter types (none, bitset, mp_bitset). - Port deleted *_kernel-inl.cuh contents into the JIT layout: device bodies in jit_lto_kernels/*_jit.cuh, .cu.in entry-points, matrix JSONs, fragment tags, planners, factory functions, host launchers, CMake registration. - Introduce mp_bitset_filter_data_t + tag_filter_mp_bitset + matching sample_filter_mp_bitset_impl so multi_partition_bitset_filter is recognized end-to-end without coupling to the standard bitset POD. - Add BitsetT template parameter to search_core so it accepts either cagra_bitset or mp_cagra_bitset without doubling instantiations. - Add CUVS_EXPORT to four C entry points that were silently hidden: cuvsRMMAsyncMemoryResourceEnable, cuvsResourcesSetWorkspacePool, cuvsCagraSearchMultiPartition, cuvsSelectK. - Update JDKProvider.java to drop the stale specific import for cudaStreamSynchronize that jextract has reshuffled into headers_h.
CUDA 13 binary size reduction from 282 MB to 257 MB (-8.86%).
Benchmark:

Apply updates from
CAGRA related PRs:
cudaFuncAttributeMaxDynamicSharedMemorySizewith thread-safety #1771JIT related PRs: