Hybrid KV cache support for mamba+attention models (Qwen3.5)

## Problem

The llm-d fs-backend needs to handle hybrid models (mamba + attention layers) where KV cache groups have different block sizes, tensor shapes, and offload semantics. Stock llm-d assumes uniform block sizes across groups.

## What we built

Extended the fs-backend to support hybrid chunk sizing and partial sub-block transfers for models like Qwen3.5-4B-FP8 (4 KV cache groups: 3 mamba + 1 attention, `hybrid_chunk_size=8192`, `block_size=1056`).

Key changes in `llmd_fs_backend/`:
- **spec.py**: per-group `gpu_blocks_per_file` calculated from `hybrid_chunk_size / group_hash_block_size`
- **worker.py**: `GroupedStorageOffloadingHandler` with per-group file mappers, tensor layouts, and store/load engines. Separated load and store engines to avoid polling races.
- **C++ tensor_copier**: partial sub-block transfers, hybrid block offset/count support

## Results

- All 4 groups store/load correctly across container restarts
- 79% cache hit rate on 30k-token prompts after cold restart
- Cross-restart hash determinism with `PYTHONHASHSEED=0`
- Graceful fallback to recompute on file size mismatches

## Branch

`malaiwah/llm-d-kv-cache:codex/hybrid-kv-offload` — 11 files, 1172 insertions, 326 deletions.

Related: vllm-project/vllm#38230, LMCache/LMCache#2879

> AI-assisted: developed with Claude. All changes reviewed and tested by a human.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid KV cache support for mamba+attention models (Qwen3.5) #465

Problem

What we built

Results

Branch

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hybrid KV cache support for mamba+attention models (Qwen3.5) #465

Description

Problem

What we built

Results

Branch

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions