Track tensor and index lifetimes in the Rust DLPack bindings#2245
Track tensor and index lifetimes in the Rust DLPack bindings#2245yan-zaretskiy wants to merge 2 commits into
Conversation
The previous `ManagedTensor` type was a non-owning wrapper over the C's FFI `DLManagedTensor` type with no lifetime attached. Hence there was no mechanism to tie the tensor data and shape/stride metadata it referenced to their owners. Indexes that keep a non-owning view of their dataset (CAGRA, brute-force) could outlive that data. Here we replace it with lifetime-parameterized `DLTensorView` and `DLTensorViewMut` views. They are produced by the public `IntoDlTensor` and `IntoDlTensorMut` traits. Users are now expected to implement these traits on their tensor types, so that our API can accept them as input/output arguments.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Since this is a fairly large change, I would like to explain my thought process to justify the proposed design. First of all, let's identify what we're trying to fix. The original The second problem is in the index types. CAGRA and brute-force don't copy the dataset at build time, because the underlying C++ implementation keeps a non-owning view and reads the original vectors during search, so the dataset has to outlive the index. The old API doesn't model that. The purpose of this PR is therefore to enable tracking two things: the lifetime of the tensor data, and the lifetime of the indexes that borrow it. To fix the first issue, we add lifetimes to the tensor view types. Moreover, we make a distinction between mutable and immutable views: Users convert their own tensor types into these views by implementing a pair of new public traits, The path from pub struct DLTensorView<'a> {
data: *mut std::ffi::c_void,
device: ffi::DLDevice,
dtype: ffi::DLDataType,
shape: TensorDims,
strides: Option<TensorDims>,
_marker: PhantomData<&'a ()>,
}where The obvious choice is to store a There is a known way to make the embedding sound: store So rather than embed the C struct, we build it only at a point we're about to call into C: pub(crate) struct ManagedTensorRef<'a> {
inner: ffi::DLManagedTensor,
_borrow: PhantomData<&'a ()>,
}
impl<'a> DLTensorView<'a> {
// Fills `inner` with `shape: self.shape.as_ptr() as *mut _`, etc.
pub(crate) fn to_c(&self) -> ManagedTensorRef<'_> { /* ... */ }
}Here As for the second fix, the index types that borrow their dataset now also get a lifetime. CAGRA and brute-force now return One special case here is indexes loaded from disk. An index produced by |
📝 WalkthroughSummary by CodeRabbit
WalkthroughReplaces the owning ChangesDLPack View API Refactor and Consumer Migration
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
rust/cuvs/src/cagra/index.rs (1)
22-23: ⚡ Quick winFix unresolved rustdoc link to
Index::merge.Line 22 links to
[Index::merge], but this type currently does not expose amergemethod in this module. That can produce a broken intra-doc link warning in docs builds.Suggested doc tweak
-/// [`Index::merge`], the data is self-contained and the lifetime is +/// a merged index, the data is self-contained and the lifetime is🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/cuvs/src/cagra/index.rs` around lines 22 - 23, The rustdoc link in the documentation comment references `Index::merge`, but the Index type does not expose this method, creating a broken intra-doc link warning. Remove the intra-doc link syntax around `Index::merge` in the doc comment (change `[`Index::merge`]` to just `Index::merge` or remove the reference entirely if it's not needed for clarity), or replace it with a reference to an actual method that exists on the Index type.rust/cuvs/examples/cagra.rs (1)
117-137: Add size validation guard toto_hostbefore copying device data to host array.The
to_hostmethod copiesself.bytesto the destination host array without verifying that the caller-provided buffer has matching capacity. This safe wrapper could silently perform an out-of-bounds write if the destination array has a different element count than the source.Add a check to ensure the destination buffer byte size matches:
Suggested guard
fn to_host<D>(&self, res: &Resources, host: &mut ndarray::ArrayRef<T, D>) -> ExampleResult<()> where D: ndarray::Dimension, { if !host.is_standard_layout() { return Err("host array must be contiguous (row-major)".into()); } + let host_bytes = host + .len() + .checked_mul(std::mem::size_of::<T>()) + .ok_or("host array size overflow")?; + if host_bytes != self.bytes { + return Err(format!( + "host buffer size mismatch: expected {} bytes, got {} bytes", + self.bytes, host_bytes + ) + .into()); + } let stream = res.get_cuda_stream()?; check_cuda(unsafe { cudaMemcpyAsync( host.as_mut_ptr() as *mut c_void,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/cuvs/examples/cagra.rs` around lines 117 - 137, The to_host method performs an unsafe memory copy via cudaMemcpyAsync without validating that the destination host array has sufficient capacity to receive self.bytes. Add a size validation check after verifying the host array is contiguous but before calling cudaMemcpyAsync to ensure the host array's byte capacity matches self.bytes, returning an appropriate error if the sizes do not match. This prevents potential out-of-bounds writes to the destination buffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rust/cuvs/Cargo.toml`:
- Line 18: The tinyvec dependency in rust/cuvs/Cargo.toml includes the
latest_stable_rust feature without an explicit rust-version declaration in the
workspace package metadata, which can cause silent breakage when new Rust stable
versions are released. Fix this by either adding a rust-version field to the
[workspace.package] section to define an explicit minimum supported Rust version
that aligns with your project requirements, or remove the latest_stable_rust
feature from the tinyvec dependency configuration and replace it with a stable
feature set that is compatible with your MSRV.
In `@rust/cuvs/src/test_utils.rs`:
- Around line 21-27: The DeviceTensor struct stores a raw ffi::cuvsResources_t
handle without tying its lifetime to the underlying Resources object, allowing
DeviceTensor to outlive Resources and cause a use-after-free when Drop calls
cuvsRMMFree with an invalid handle. Add a lifetime parameter to the DeviceTensor
struct definition (around line 21) and bind the resources field to that lifetime
by storing a reference instead of a raw pointer. Update the Drop implementation
(around lines 30-43) to work with the borrowed reference instead of the raw
handle. Ensure all construction sites of DeviceTensor (around lines 116-120)
properly pass a reference with the correct lifetime to guarantee the resource
stays valid for the entire lifetime of the DeviceTensor instance.
---
Nitpick comments:
In `@rust/cuvs/examples/cagra.rs`:
- Around line 117-137: The to_host method performs an unsafe memory copy via
cudaMemcpyAsync without validating that the destination host array has
sufficient capacity to receive self.bytes. Add a size validation check after
verifying the host array is contiguous but before calling cudaMemcpyAsync to
ensure the host array's byte capacity matches self.bytes, returning an
appropriate error if the sizes do not match. This prevents potential
out-of-bounds writes to the destination buffer.
In `@rust/cuvs/src/cagra/index.rs`:
- Around line 22-23: The rustdoc link in the documentation comment references
`Index::merge`, but the Index type does not expose this method, creating a
broken intra-doc link warning. Remove the intra-doc link syntax around
`Index::merge` in the doc comment (change `[`Index::merge`]` to just
`Index::merge` or remove the reference entirely if it's not needed for clarity),
or replace it with a reference to an actual method that exists on the Index
type.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1e292a2b-8b54-4ea0-84e8-da899f2498fb
📒 Files selected for processing (18)
rust/cuvs/Cargo.tomlrust/cuvs/examples/cagra.rsrust/cuvs/src/brute_force.rsrust/cuvs/src/cagra/index.rsrust/cuvs/src/cagra/mod.rsrust/cuvs/src/cluster/kmeans/mod.rsrust/cuvs/src/distance/mod.rsrust/cuvs/src/dlpack.rsrust/cuvs/src/error.rsrust/cuvs/src/ivf_flat/index.rsrust/cuvs/src/ivf_flat/mod.rsrust/cuvs/src/ivf_pq/index.rsrust/cuvs/src/ivf_pq/mod.rsrust/cuvs/src/lib.rsrust/cuvs/src/resources.rsrust/cuvs/src/test_utils.rsrust/cuvs/src/vamana/index.rsrust/cuvs/src/vamana/mod.rs
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rust/cuvs/src/brute_force.rs (1)
147-148:⚠️ Potential issue | 🟠 MajorChange
neighborstype fromi64tou32to match FFI contract.The FFI documentation for
cuvsBruteForceSearchexplicitly requiresneighborsto have typekDLUIntwith 32 bits (i.e.,u32). Usingi64violates this contract and risks silent data corruption if the C implementation writes 32-bit values to a 64-bit buffer. CAGRA tests correctly useu32.Update both allocations:
Suggested changes
- let mut neighbors_host = ndarray::Array::<i64, _>::zeros((n_queries, k)); - let mut neighbors = DeviceTensor::<i64>::zeros(&res, &[n_queries, k]).unwrap(); + let mut neighbors_host = ndarray::Array::<u32, _>::zeros((n_queries, k)); + let mut neighbors = DeviceTensor::<u32>::zeros(&res, &[n_queries, k]).unwrap();Also update any assertions that compare or cast
neighborsvalues accordingly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/cuvs/src/brute_force.rs` around lines 147 - 148, The `neighbors` and `neighbors_host` tensor declarations are using `i64` type, but the FFI contract for `cuvsBruteForceSearch` requires `u32`. Update the type parameter from `i64` to `u32` in both the `neighbors_host` ndarray allocation and the `neighbors` DeviceTensor allocation. Additionally, search the file for any assertions, comparisons, or casts involving `neighbors` values and update them to work correctly with the new `u32` type instead of `i64`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@rust/cuvs/src/brute_force.rs`:
- Around line 147-148: The `neighbors` and `neighbors_host` tensor declarations
are using `i64` type, but the FFI contract for `cuvsBruteForceSearch` requires
`u32`. Update the type parameter from `i64` to `u32` in both the
`neighbors_host` ndarray allocation and the `neighbors` DeviceTensor allocation.
Additionally, search the file for any assertions, comparisons, or casts
involving `neighbors` values and update them to work correctly with the new
`u32` type instead of `i64`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e9417b5c-73db-45d8-99a4-ac58de5132e3
📒 Files selected for processing (6)
rust/cuvs/Cargo.tomlrust/cuvs/src/brute_force.rsrust/cuvs/src/cagra/index.rsrust/cuvs/src/cluster/kmeans/mod.rsrust/cuvs/src/distance/mod.rsrust/cuvs/src/test_utils.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- rust/cuvs/src/distance/mod.rs
- rust/cuvs/src/test_utils.rs
- rust/cuvs/src/cluster/kmeans/mod.rs
The previous
ManagedTensortype was a non-owning wrapper over the C's FFIDLManagedTensortype with no lifetime attached. Hence there was no mechanism to tie the tensor data and shape/stride metadata it referenced to their owners. Indexes that keep a non-owning view of their dataset (CAGRA, brute-force) could outlive that data. Here we replace it with lifetime-parameterizedDLTensorViewandDLTensorViewMutviews. They are produced by the publicIntoDlTensorandIntoDlTensorMuttraits. Users are now expected to implement these traits on their tensor types, so that our API can accept them as input/output arguments.