Summary
Under high request concurrency with SharedStorageOffloadingSpec, storage_offload's background I/O threads attempt to cuda_memcpy from GPU KV tensors that vLLM's memory manager has already freed. The writes fail silently — no crash, no 503, no recovery — resulting in KV blocks that are never written to the filesystem cache.
Environment
| Component |
Version |
| vLLM |
0.17.1 |
| llm-d |
v0.6.0 (ghcr.io/llm-d/llm-d-cuda:v0.6.0) |
| llmd_fs_connector |
0.18 |
| GPUs |
2× NVIDIA L40S (tensor-parallel-size=2) |
| Executor backend |
--distributed-executor-backend mp |
| Connector |
OffloadingConnector / SharedStorageOffloadingSpec |
Observed Error
Appears in bursts (multiple threads simultaneously) immediately under high concurrency load. All failures are on rank_1 only; rank_0 is not affected.
2026-04-09 00:35:28.544 [ERROR] [thread:140188806280768] Store failed for
/kvcache/kv-cache//Qwen/Qwen3-0.6B/block_size_16_blocks_per_file_16/tp_2_pp_size_1_pcp_size_1/rank_1/auto/474/da/474da563604bfe8e.bin:
Cannot access data pointer of Tensor that doesn't have storage
Exception raised from throw_data_ptr_access_error at /pytorch/c10/core/TensorImpl.cpp:307
frame #2: c10::TensorImpl::throw_data_ptr_access_error()
frame #3: TensorCopier::copy_blocks_via_cuda_memcpy(unsigned char*, std::vector<long> const&, bool)
frame #4: FileIO::write_blocks_to_file(...)
frame #5-#6: <I/O thread pool dispatch>
Multiple threads (140188806280768, 140188730746432, 140188789495360, ...) fail within the same millisecond on different block paths.
Root Cause
The storage_offload background write threads hold tensor references by physical address (or block index), but do not prevent vLLM's KV cache allocator from freeing and reallocating the underlying GPU memory. When the write thread eventually executes copy_blocks_via_cuda_memcpy, the TensorImpl storage has been released → throw_data_ptr_access_error.
The error is caught and logged but swallowed — the write silently fails. The affected KV blocks are never written to the filesystem cache, meaning those blocks cannot be restored on a future cache miss.
Affects rank_1 only in TP=2, suggesting a timing difference between rank execution — rank_0 completes its write before the block is freed; rank_1 does not.
Impact
Suggested Fix
The write path must hold a reference that prevents the KV tensor's storage from being freed until the CUDA memcpy completes. Options:
- Ref-count the tensor before enqueuing the write; release after
copy_blocks_via_cuda_memcpy returns
- Copy to a staging buffer synchronously (on the caller thread) before handing off to the I/O thread pool, so the I/O thread never touches the original GPU tensor
Summary
Under high request concurrency with
SharedStorageOffloadingSpec,storage_offload's background I/O threads attempt tocuda_memcpyfrom GPU KV tensors that vLLM's memory manager has already freed. The writes fail silently — no crash, no 503, no recovery — resulting in KV blocks that are never written to the filesystem cache.Environment
ghcr.io/llm-d/llm-d-cuda:v0.6.0)--distributed-executor-backend mpOffloadingConnector/SharedStorageOffloadingSpecObserved Error
Appears in bursts (multiple threads simultaneously) immediately under high concurrency load. All failures are on rank_1 only; rank_0 is not affected.
Multiple threads (140188806280768, 140188730746432, 140188789495360, ...) fail within the same millisecond on different block paths.
Root Cause
The
storage_offloadbackground write threads hold tensor references by physical address (or block index), but do not prevent vLLM's KV cache allocator from freeing and reallocating the underlying GPU memory. When the write thread eventually executescopy_blocks_via_cuda_memcpy, theTensorImplstorage has been released →throw_data_ptr_access_error.The error is caught and logged but swallowed — the write silently fails. The affected KV blocks are never written to the filesystem cache, meaning those blocks cannot be restored on a future cache miss.
Affects rank_1 only in TP=2, suggesting a timing difference between rank execution — rank_0 completes its write before the block is freed; rank_1 does not.
Impact
shm_broadcastEngineCore stall (llmd fs backend: EngineCore deadlock under SharedStorageOffloadingSpec + mp executor at high concurrency #457) under the same load profileSuggested Fix
The write path must hold a reference that prevents the KV tensor's storage from being freed until the CUDA memcpy completes. Options:
copy_blocks_via_cuda_memcpyreturns