feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector by kfirtoledo · Pull Request #607 · llm-d/llm-d-kv-cache

kfirtoledo · 2026-05-26T10:08:59Z

Summary

Replace the per-(block, layer) cudaMemcpyAsync loop in TensorCopier with a single cudaMemcpyBatchAsync (CUDA 12.8+) submission. Submits all descriptors in one driver call, removing per-call dispatch overhead.

Enabled by default; toggle off via USE_BATCH_MEMCPY_READ=0 / USE_BATCH_MEMCPY_WRITE=0.
Per-call DMA loop kept as fallback for older CUDA toolkits and A/B debugging.
srcAccessOrder=ANY set on cudaMemcpyAttributes (matches vLLM's simple_kv_offload/cuda_mem_ops.py).
#if CUDA_VERSION handles the failIdx out-param that CUDA 13 dropped.

Measured impact (128k tokens, TP=4, `--block-size 512`)

Workload	no-BATCH	BATCH	speedup
gpt-oss-20b (HMA) cold	3.35s	1.77s	1.9x
gpt-oss-20b (HMA) hot	0.58s	0.30s	1.9x
gpt-oss-120b (HMA) cold	4.95s	2.51s	2.0x
gpt-oss-120b (HMA) hot	0.83s	0.34s	2.4x
Llama-3.1-8B hot	0.45s	0.43s	neutral
Llama-3.1-70B hot	0.84s	0.85s	neutral

Big wins on HMA models where per-layer DMAs are small; neutral on Llama/no-HMA where each per-call copy is already large enough that driver dispatch is amortized.

Test plan

`make test` — 30 passed, 3 skipped (all on this branch, batched path active by default).
Manual storage roundtrip on Llama-3.1-8B / 70B + gpt-oss-20b/120b with batch enabled and disabled (see table).

Submit all per-(block, layer) copies in one driver call instead of N cudaMemcpyAsync calls. Enabled by default; toggle off with USE_BATCH_MEMCPY_READ / USE_BATCH_MEMCPY_WRITE=0. Requires CUDA 12.8+. Speeds up KV-cache offload writes/reads when per-layer DMA sizes are small enough that driver dispatch dominates. Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>

Etelis

Fallback on << 12800 o.w LGTM

Etelis · 2026-05-28T15:47:28Z

+                                         &attrs_idx,
+                                         /*numAttrs=*/1,
+                                         stream.stream());
+#else


What if you're on CUDA << 12800?
you'll fail (cudaMemcpyBatchAsync wasn't yet introduced)

Etelis · 2026-05-28T15:49:49Z

+// Batched DMA path: one cudaMemcpyBatchAsync covers all per-(block, layer)
+// copies for the blocks in this file (num_blocks * num_tensors).
+// The batch executes in stream order; ordering within the batch is unspecified.
+void TensorCopier::copy_blocks_via_batch_memcpy(


I'm not sure it's relevant, but HIP (AMD) also supports batch memcpy out of the box, and might be worth adding as well.

Etelis · 2026-05-28T16:00:00Z

+                                         &fail_idx,
+                                         stream.stream());
+#endif
+  TORCH_CHECK(err == cudaSuccess,


Suggested change

TORCH_CHECK(err == cudaSuccess,

TORCH_CHECK(err == cudaSuccess,

"cudaMemcpyBatchAsync failed err=",

cudaGetErrorString(err));

Etelis · 2026-05-28T16:00:41Z

+
+  // Set attributes with srcAccessOrder=ANY (cudaMemcpySrcAccessOrderAny)
+  // for malloc'd host staging buffer. Same as vLLM's cuda_mem_ops.py.
+  thread_local cudaMemcpyAttributes attrs = [] {


should be static const

kfirtoledo requested review from dannyharnik, liu-cong and vMaroon as code owners May 26, 2026 10:09

github-actions Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 26, 2026

github-actions Bot requested review from hyeongyun0916, sagearc and yankay May 26, 2026 10:09

kfirtoledo force-pushed the batch-memcpy branch from c7fd16e to 191f072 Compare May 26, 2026 10:17

github-actions Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 26, 2026

Etelis reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector#607

feat: batch KV block copies via cudaMemcpyBatchAsync in fs connector#607
kfirtoledo wants to merge 1 commit into
llm-d:mainfrom
kfirtoledo:batch-memcpy

kfirtoledo commented May 26, 2026

Uh oh!

Etelis left a comment

Uh oh!

Etelis May 28, 2026

Uh oh!

Etelis May 28, 2026

Uh oh!

Etelis May 28, 2026

Uh oh!

Etelis May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kfirtoledo commented May 26, 2026

Summary

Measured impact (128k tokens, TP=4, --block-size 512)

Test plan

Uh oh!

Etelis left a comment

Choose a reason for hiding this comment

Uh oh!

Etelis May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Etelis May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Etelis May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Etelis May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Measured impact (128k tokens, TP=4, `--block-size 512`)