Problem / Motivation
When compressed cache is active, ContextSize(0) is used and auto_device_map.rs returns 0 bytes for KV cache size estimation. This means no memory limit is enforced. In server mode (HTTP API), an attacker can send arbitrarily long sequences to exhaust all GPU memory.
Identified by Copilot review (Finding C2).
Affected files
mistralrs-server-core/src/mistralrs_for_server_builder.rs — ContextSize(0)
mistralrs-core/src/pipeline/loaders/auto_device_map.rs — returns 0 for compressed cache
Solution
Even though compressed cache doesn't use PagedAttention's block pool, it still needs a memory budget:
-
Estimate compressed cache size based on model config:
per_token = num_kv_heads * head_dim * 2 (K+V) * bits/8 * num_layers
max_tokens = available_vram / per_token
-
Enforce max_seq_len in the compressed cache:
if current_seq_len >= max_seq_len {
return Err("Maximum sequence length exceeded for compressed cache");
}
-
Pass actual memory estimate to device mapping instead of 0.
Acceptance criteria
Problem / Motivation
When compressed cache is active,
ContextSize(0)is used andauto_device_map.rsreturns 0 bytes for KV cache size estimation. This means no memory limit is enforced. In server mode (HTTP API), an attacker can send arbitrarily long sequences to exhaust all GPU memory.Identified by Copilot review (Finding C2).
Affected files
mistralrs-server-core/src/mistralrs_for_server_builder.rs—ContextSize(0)mistralrs-core/src/pipeline/loaders/auto_device_map.rs— returns 0 for compressed cacheSolution
Even though compressed cache doesn't use PagedAttention's block pool, it still needs a memory budget:
Estimate compressed cache size based on model config:
Enforce max_seq_len in the compressed cache:
Pass actual memory estimate to device mapping instead of 0.
Acceptance criteria