Problem / Motivation
All transformer layers share a single Arc<Mutex<dyn CompressedKVCache>>. Every decode step acquires this mutex once per layer — with 32 layers, that's 32 sequential lock acquisitions per token. With batch size > 1, all sequences serialize on this lock.
Identified by Copilot review (Finding M1).
Current code (kv_cache/mod.rs)
pub enum KvCache {
Compressed {
cache: Arc<Mutex<dyn CompressedKVCache>>, // shared across ALL layers
layer: usize,
...
}
}
Solution
Options (in order of preference):
Option A: Per-layer storage within the cache
The CompressedKVCache trait already receives layer: usize in every method call. Internally, storage is already per-layer. The Mutex only serializes because the trait object is shared. If the cache implementation uses internal per-layer locking (e.g., RwLock per layer), the outer Mutex can be downgraded.
Option B: Vec of per-layer Mutexes
// Instead of one Mutex for the whole cache:
cache: Vec<Arc<Mutex<LayerCache>>>, // one per layer
Option C: Lock-free decode path
If decode only reads from committed cache + writes to a single new slot, it could use atomic operations instead of a mutex.
Acceptance criteria
Problem / Motivation
All transformer layers share a single
Arc<Mutex<dyn CompressedKVCache>>. Every decode step acquires this mutex once per layer — with 32 layers, that's 32 sequential lock acquisitions per token. With batch size > 1, all sequences serialize on this lock.Identified by Copilot review (Finding M1).
Current code (
kv_cache/mod.rs)Solution
Options (in order of preference):
Option A: Per-layer storage within the cache
The
CompressedKVCachetrait already receiveslayer: usizein every method call. Internally, storage is already per-layer. The Mutex only serializes because the trait object is shared. If the cache implementation uses internal per-layer locking (e.g.,RwLockper layer), the outer Mutex can be downgraded.Option B: Vec of per-layer Mutexes
Option C: Lock-free decode path
If decode only reads from committed cache + writes to a single new slot, it could use atomic operations instead of a mutex.
Acceptance criteria