Skip to content

llmd_fs_connector: background write threads access freed GPU tensors under high concurrency #504

@natoscott

Description

@natoscott

Summary

Under high request concurrency with SharedStorageOffloadingSpec, storage_offload's background I/O threads attempt to cuda_memcpy from GPU KV tensors that vLLM's memory manager has already freed. The writes fail silently — no crash, no 503, no recovery — resulting in KV blocks that are never written to the filesystem cache.

Environment

Component Version
vLLM 0.17.1
llm-d v0.6.0 (ghcr.io/llm-d/llm-d-cuda:v0.6.0)
llmd_fs_connector 0.18
GPUs 2× NVIDIA L40S (tensor-parallel-size=2)
Executor backend --distributed-executor-backend mp
Connector OffloadingConnector / SharedStorageOffloadingSpec

Observed Error

Appears in bursts (multiple threads simultaneously) immediately under high concurrency load. All failures are on rank_1 only; rank_0 is not affected.

2026-04-09 00:35:28.544 [ERROR] [thread:140188806280768] Store failed for
/kvcache/kv-cache//Qwen/Qwen3-0.6B/block_size_16_blocks_per_file_16/tp_2_pp_size_1_pcp_size_1/rank_1/auto/474/da/474da563604bfe8e.bin:
Cannot access data pointer of Tensor that doesn't have storage
Exception raised from throw_data_ptr_access_error at /pytorch/c10/core/TensorImpl.cpp:307
frame #2: c10::TensorImpl::throw_data_ptr_access_error()
frame #3: TensorCopier::copy_blocks_via_cuda_memcpy(unsigned char*, std::vector<long> const&, bool)
frame #4: FileIO::write_blocks_to_file(...)
frame #5-#6: <I/O thread pool dispatch>

Multiple threads (140188806280768, 140188730746432, 140188789495360, ...) fail within the same millisecond on different block paths.

Root Cause

The storage_offload background write threads hold tensor references by physical address (or block index), but do not prevent vLLM's KV cache allocator from freeing and reallocating the underlying GPU memory. When the write thread eventually executes copy_blocks_via_cuda_memcpy, the TensorImpl storage has been released → throw_data_ptr_access_error.

The error is caught and logged but swallowed — the write silently fails. The affected KV blocks are never written to the filesystem cache, meaning those blocks cannot be restored on a future cache miss.

Affects rank_1 only in TP=2, suggesting a timing difference between rank execution — rank_0 completes its write before the block is freed; rank_1 does not.

Impact

Suggested Fix

The write path must hold a reference that prevents the KV tensor's storage from being freed until the CUDA memcpy completes. Options:

  1. Ref-count the tensor before enqueuing the write; release after copy_blocks_via_cuda_memcpy returns
  2. Copy to a staging buffer synchronously (on the caller thread) before handing off to the I/O thread pool, so the I/O thread never touches the original GPU tensor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions