DoS protection: memory limit for compressed KV cache in server mode

## Problem / Motivation

When compressed cache is active, `ContextSize(0)` is used and `auto_device_map.rs` returns 0 bytes for KV cache size estimation. This means **no memory limit** is enforced. In server mode (HTTP API), an attacker can send arbitrarily long sequences to exhaust all GPU memory.

Identified by Copilot review (Finding C2).

## Affected files
- `mistralrs-server-core/src/mistralrs_for_server_builder.rs` — `ContextSize(0)`
- `mistralrs-core/src/pipeline/loaders/auto_device_map.rs` — returns 0 for compressed cache

## Solution

Even though compressed cache doesn't use PagedAttention's block pool, it still needs a memory budget:

1. **Estimate compressed cache size** based on model config:
   ```
   per_token = num_kv_heads * head_dim * 2 (K+V) * bits/8 * num_layers
   max_tokens = available_vram / per_token
   ```

2. **Enforce max_seq_len** in the compressed cache:
   ```rust
   if current_seq_len >= max_seq_len {
       return Err("Maximum sequence length exceeded for compressed cache");
   }
   ```

3. **Pass actual memory estimate** to device mapping instead of 0.

## Acceptance criteria
- [ ] Compressed cache has a configurable maximum sequence length
- [ ] Server mode enforces the limit
- [ ] Device mapping gets a realistic memory estimate
- [ ] Long sequences return a proper error, not OOM crash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DoS protection: memory limit for compressed KV cache in server mode #45

Problem / Motivation

Affected files

Solution

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DoS protection: memory limit for compressed KV cache in server mode #45

Description

Problem / Motivation

Affected files

Solution

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions