llmd_fs_connector: background write threads access freed GPU tensors under high concurrency

## Summary

Under high request concurrency with `SharedStorageOffloadingSpec`, `storage_offload`'s background I/O threads attempt to `cuda_memcpy` from GPU KV tensors that vLLM's memory manager has already freed. The writes fail silently — no crash, no 503, no recovery — resulting in KV blocks that are never written to the filesystem cache.

## Environment

| Component | Version |
|-----------|---------|
| vLLM | 0.17.1 |
| llm-d | v0.6.0 (`ghcr.io/llm-d/llm-d-cuda:v0.6.0`) |
| llmd_fs_connector | 0.18 |
| GPUs | 2× NVIDIA L40S (tensor-parallel-size=2) |
| Executor backend | `--distributed-executor-backend mp` |
| Connector | `OffloadingConnector` / `SharedStorageOffloadingSpec` |

## Observed Error

Appears in bursts (multiple threads simultaneously) immediately under high concurrency load. All failures are on **rank_1** only; rank_0 is not affected.

```
2026-04-09 00:35:28.544 [ERROR] [thread:140188806280768] Store failed for
/kvcache/kv-cache//Qwen/Qwen3-0.6B/block_size_16_blocks_per_file_16/tp_2_pp_size_1_pcp_size_1/rank_1/auto/474/da/474da563604bfe8e.bin:
Cannot access data pointer of Tensor that doesn't have storage
Exception raised from throw_data_ptr_access_error at /pytorch/c10/core/TensorImpl.cpp:307
frame #2: c10::TensorImpl::throw_data_ptr_access_error()
frame #3: TensorCopier::copy_blocks_via_cuda_memcpy(unsigned char*, std::vector<long> const&, bool)
frame #4: FileIO::write_blocks_to_file(...)
frame #5-#6: <I/O thread pool dispatch>
```

Multiple threads (140188806280768, 140188730746432, 140188789495360, ...) fail within the same millisecond on different block paths.

## Root Cause

The `storage_offload` background write threads hold tensor references by physical address (or block index), but do not prevent vLLM's KV cache allocator from freeing and reallocating the underlying GPU memory. When the write thread eventually executes `copy_blocks_via_cuda_memcpy`, the `TensorImpl` storage has been released → `throw_data_ptr_access_error`.

The error is caught and logged but swallowed — the write silently fails. The affected KV blocks are never written to the filesystem cache, meaning those blocks cannot be restored on a future cache miss.

Affects rank_1 only in TP=2, suggesting a timing difference between rank execution — rank_0 completes its write before the block is freed; rank_1 does not.

## Impact

- Silent: no crash, no HTTP errors, no vLLM restart
- KV blocks are dropped rather than offloaded; filesystem cache effectiveness is degraded
- Co-occurs with the `shm_broadcast` EngineCore stall (#457) under the same load profile

## Suggested Fix

The write path must hold a reference that prevents the KV tensor's storage from being freed until the CUDA memcpy completes. Options:
1. **Ref-count the tensor** before enqueuing the write; release after `copy_blocks_via_cuda_memcpy` returns
2. **Copy to a staging buffer synchronously** (on the caller thread) before handing off to the I/O thread pool, so the I/O thread never touches the original GPU tensor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llmd_fs_connector: background write threads access freed GPU tensors under high concurrency #504

Summary

Environment

Observed Error

Root Cause

Impact

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version
vLLM	0.17.1
llm-d	v0.6.0 (`ghcr.io/llm-d/llm-d-cuda:v0.6.0`)
llmd_fs_connector	0.18
GPUs	2× NVIDIA L40S (tensor-parallel-size=2)
Executor backend	`--distributed-executor-backend mp`
Connector	`OffloadingConnector` / `SharedStorageOffloadingSpec`

llmd_fs_connector: background write threads access freed GPU tensors under high concurrency #504

Description

Summary

Environment

Observed Error

Root Cause

Impact

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions