Problem
The llm-d fs-backend needs to handle hybrid models (mamba + attention layers) where KV cache groups have different block sizes, tensor shapes, and offload semantics. Stock llm-d assumes uniform block sizes across groups.
What we built
Extended the fs-backend to support hybrid chunk sizing and partial sub-block transfers for models like Qwen3.5-4B-FP8 (4 KV cache groups: 3 mamba + 1 attention, hybrid_chunk_size=8192, block_size=1056).
Key changes in llmd_fs_backend/:
- spec.py: per-group
gpu_blocks_per_file calculated from hybrid_chunk_size / group_hash_block_size
- worker.py:
GroupedStorageOffloadingHandler with per-group file mappers, tensor layouts, and store/load engines. Separated load and store engines to avoid polling races.
- C++ tensor_copier: partial sub-block transfers, hybrid block offset/count support
Results
- All 4 groups store/load correctly across container restarts
- 79% cache hit rate on 30k-token prompts after cold restart
- Cross-restart hash determinism with
PYTHONHASHSEED=0
- Graceful fallback to recompute on file size mismatches
Branch
malaiwah/llm-d-kv-cache:codex/hybrid-kv-offload — 11 files, 1172 insertions, 326 deletions.
Related: vllm-project/vllm#38230, LMCache/LMCache#2879
AI-assisted: developed with Claude. All changes reviewed and tested by a human.
Problem
The llm-d fs-backend needs to handle hybrid models (mamba + attention layers) where KV cache groups have different block sizes, tensor shapes, and offload semantics. Stock llm-d assumes uniform block sizes across groups.
What we built
Extended the fs-backend to support hybrid chunk sizing and partial sub-block transfers for models like Qwen3.5-4B-FP8 (4 KV cache groups: 3 mamba + 1 attention,
hybrid_chunk_size=8192,block_size=1056).Key changes in
llmd_fs_backend/:gpu_blocks_per_filecalculated fromhybrid_chunk_size / group_hash_block_sizeGroupedStorageOffloadingHandlerwith per-group file mappers, tensor layouts, and store/load engines. Separated load and store engines to avoid polling races.Results
PYTHONHASHSEED=0Branch
malaiwah/llm-d-kv-cache:codex/hybrid-kv-offload— 11 files, 1172 insertions, 326 deletions.Related: vllm-project/vllm#38230, LMCache/LMCache#2879