Skip to content

Hybrid KV cache support for mamba+attention models (Qwen3.5) #465

@malaiwah

Description

@malaiwah

Problem

The llm-d fs-backend needs to handle hybrid models (mamba + attention layers) where KV cache groups have different block sizes, tensor shapes, and offload semantics. Stock llm-d assumes uniform block sizes across groups.

What we built

Extended the fs-backend to support hybrid chunk sizing and partial sub-block transfers for models like Qwen3.5-4B-FP8 (4 KV cache groups: 3 mamba + 1 attention, hybrid_chunk_size=8192, block_size=1056).

Key changes in llmd_fs_backend/:

  • spec.py: per-group gpu_blocks_per_file calculated from hybrid_chunk_size / group_hash_block_size
  • worker.py: GroupedStorageOffloadingHandler with per-group file mappers, tensor layouts, and store/load engines. Separated load and store engines to avoid polling races.
  • C++ tensor_copier: partial sub-block transfers, hybrid block offset/count support

Results

  • All 4 groups store/load correctly across container restarts
  • 79% cache hit rate on 30k-token prompts after cold restart
  • Cross-restart hash determinism with PYTHONHASHSEED=0
  • Graceful fallback to recompute on file size mismatches

Branch

malaiwah/llm-d-kv-cache:codex/hybrid-kv-offload — 11 files, 1172 insertions, 326 deletions.

Related: vllm-project/vllm#38230, LMCache/LMCache#2879

AI-assisted: developed with Claude. All changes reviewed and tested by a human.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions