Skip to content

Per-layer locks for compressed cache instead of single global Mutex #46

@SaschaOnTour

Description

@SaschaOnTour

Problem / Motivation

All transformer layers share a single Arc<Mutex<dyn CompressedKVCache>>. Every decode step acquires this mutex once per layer — with 32 layers, that's 32 sequential lock acquisitions per token. With batch size > 1, all sequences serialize on this lock.

Identified by Copilot review (Finding M1).

Current code (kv_cache/mod.rs)

pub enum KvCache {
    Compressed {
        cache: Arc<Mutex<dyn CompressedKVCache>>,  // shared across ALL layers
        layer: usize,
        ...
    }
}

Solution

Options (in order of preference):

Option A: Per-layer storage within the cache
The CompressedKVCache trait already receives layer: usize in every method call. Internally, storage is already per-layer. The Mutex only serializes because the trait object is shared. If the cache implementation uses internal per-layer locking (e.g., RwLock per layer), the outer Mutex can be downgraded.

Option B: Vec of per-layer Mutexes

// Instead of one Mutex for the whole cache:
cache: Vec<Arc<Mutex<LayerCache>>>,  // one per layer

Option C: Lock-free decode path
If decode only reads from committed cache + writes to a single new slot, it could use atomic operations instead of a mutex.

Acceptance criteria

  • Decode with batch_size > 1 does not serialize all layers
  • No deadlocks
  • All existing tests pass
  • Benchmark: measurable improvement with batch_size >= 2

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions