feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector#607
Open
kfirtoledo wants to merge 1 commit into
Open
feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector#607kfirtoledo wants to merge 1 commit into
kfirtoledo wants to merge 1 commit into
Conversation
Submit all per-(block, layer) copies in one driver call instead of N cudaMemcpyAsync calls. Enabled by default; toggle off with USE_BATCH_MEMCPY_READ / USE_BATCH_MEMCPY_WRITE=0. Requires CUDA 12.8+. Speeds up KV-cache offload writes/reads when per-layer DMA sizes are small enough that driver dispatch dominates. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
Etelis
reviewed
May 28, 2026
| &attrs_idx, | ||
| /*numAttrs=*/1, | ||
| stream.stream()); | ||
| #else |
There was a problem hiding this comment.
What if you're on CUDA << 12800?
you'll fail (cudaMemcpyBatchAsync wasn't yet introduced)
| // Batched DMA path: one cudaMemcpyBatchAsync covers all per-(block, layer) | ||
| // copies for the blocks in this file (num_blocks * num_tensors). | ||
| // The batch executes in stream order; ordering within the batch is unspecified. | ||
| void TensorCopier::copy_blocks_via_batch_memcpy( |
There was a problem hiding this comment.
I'm not sure it's relevant, but HIP (AMD) also supports batch memcpy out of the box, and might be worth adding as well.
| &fail_idx, | ||
| stream.stream()); | ||
| #endif | ||
| TORCH_CHECK(err == cudaSuccess, |
There was a problem hiding this comment.
Suggested change
| TORCH_CHECK(err == cudaSuccess, | |
| TORCH_CHECK(err == cudaSuccess, | |
| "cudaMemcpyBatchAsync failed err=", | |
| cudaGetErrorString(err)); |
|
|
||
| // Set attributes with srcAccessOrder=ANY (cudaMemcpySrcAccessOrderAny) | ||
| // for malloc'd host staging buffer. Same as vLLM's cuda_mem_ops.py. | ||
| thread_local cudaMemcpyAttributes attrs = [] { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the per-(block, layer)
cudaMemcpyAsyncloop inTensorCopierwith a singlecudaMemcpyBatchAsync(CUDA 12.8+) submission. Submits all descriptors in one driver call, removing per-call dispatch overhead.USE_BATCH_MEMCPY_READ=0/USE_BATCH_MEMCPY_WRITE=0.srcAccessOrder=ANYset oncudaMemcpyAttributes(matches vLLM'ssimple_kv_offload/cuda_mem_ops.py).#if CUDA_VERSIONhandles the failIdx out-param that CUDA 13 dropped.Measured impact (128k tokens, TP=4,
--block-size 512)Big wins on HMA models where per-layer DMAs are small; neutral on Llama/no-HMA where each per-call copy is already large enough that driver dispatch is amortized.
Test plan