Skip to content

Conversation

@ldematte
Copy link
Contributor

While profiling cuvs-java, we found that allocating a PinnedMemoryBuffer for each host->device or device->host memory copy was unnecessary and wasteful.
This PR moves the allocation of a PinnedMemoryBuffer to CuVSResourcesImpl, so that the buffer can be cached and reused. Since CuVSResources are already meant to be per-thread, this is safe, as the PinnedMemoryBuffer will never be used concurrently.
In order to do it cleanly, we introduced two named ScopedAccess classes and a helper method that will always find its way to the internal MemorySegment used by native functions to access the buffer, without the need to expose it via the public interface.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ldematte ldematte changed the base branch from main to branch-25.12 October 21, 2025 12:25
@ldematte ldematte changed the base branch from branch-25.12 to main October 21, 2025 14:36
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Oct 22, 2025
try (var localArena = Arena.ofConfined()) {
MemorySegment pointer = localArena.allocate(C_POINTER);
checkCudaError(cudaMallocHost(pointer, bufferBytes), "cudaMallocHost");
checkCudaError(cudaMallocHost(pointer, PinnedMemoryBuffer.CHUNK_BYTES), "cudaMallocHost");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong.

There were a couple of advantages to the previous approach that now seem to be missing:

  1. If the total number of bytes did not exceed CHUNK_BYTES, we would make a smaller allocation. By removing this check, we're guaranteeing 8MB allocation per thread just for pinned memory. Is that wise?
  2. There was protection against a very large row-size (8MB). Now we ignore the possibility. Wouldn't that mean a buffer-overrun when a single row is copied? Or is that not deemed possible now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the total number of bytes did not exceed CHUNK_BYTES, we would make a smaller allocation. By removing this check, we're guaranteeing 8MB allocation per thread just for pinned memory. Is that wise?

We discussed this with @achirkin and think this is worth it. It's true we are allocating 8MB per-resource, but it's also true we do this on-demand, and the benefit of not having to re-allocate every time (which is costly, due to the lock on CUDA context) is worth it.

There was protection against a very large row-size (8MB). Now we ignore the possibility. Wouldn't that mean a buffer-overrun when a single row is copied? Or is that not deemed possible now?

It's true. It's very unlikely, I would say impossible in our scenarios, but it's something missing. Let me add this protection back (although it will need to be in a different place)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍 on the change so far.

It's very unlikely, I would say impossible in our scenarios, but it's something missing. Let me add this protection back...

I'll take another look when this changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please :) I'll ping you when I'll push the change!

@mythrocks
Copy link
Contributor

/ok to test aa5e469

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mythrocks
Copy link
Contributor

I've rebased this PR and resolved a conflict.
But it looks like we're going to need to fix the copyright headers on the changed files.

@mythrocks mythrocks self-requested a review October 28, 2025 22:14
@benfred
Copy link
Member

benfred commented Oct 28, 2025

/ok to test d7fcdb3

@benfred
Copy link
Member

benfred commented Oct 28, 2025

/ok to test c4f0570

@benfred
Copy link
Member

benfred commented Oct 29, 2025

/ok to test d792fce

@mythrocks
Copy link
Contributor

/ok to test d792fce

@benfred: This PR will likely continue to fail CI until the copyright headers are fixed for this change. Plus, there's another change pending (for handling cases where the row is super-large).

@benfred
Copy link
Member

benfred commented Oct 29, 2025

@mythrocks the copyright issue should be fixed with c4f0570 (but CI is still failing here on CagraMultiThreadStabilityIT.testQueryingUsingMultipleThreadsWithPrivateResources:62->testQueryingUsingMultipleThreads:162 MultiThreaded stablity test failed: null errors in the java unittests =(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

4 participants