Skip to content

DoS protection: memory limit for compressed KV cache in server mode #45

@SaschaOnTour

Description

@SaschaOnTour

Problem / Motivation

When compressed cache is active, ContextSize(0) is used and auto_device_map.rs returns 0 bytes for KV cache size estimation. This means no memory limit is enforced. In server mode (HTTP API), an attacker can send arbitrarily long sequences to exhaust all GPU memory.

Identified by Copilot review (Finding C2).

Affected files

  • mistralrs-server-core/src/mistralrs_for_server_builder.rsContextSize(0)
  • mistralrs-core/src/pipeline/loaders/auto_device_map.rs — returns 0 for compressed cache

Solution

Even though compressed cache doesn't use PagedAttention's block pool, it still needs a memory budget:

  1. Estimate compressed cache size based on model config:

    per_token = num_kv_heads * head_dim * 2 (K+V) * bits/8 * num_layers
    max_tokens = available_vram / per_token
    
  2. Enforce max_seq_len in the compressed cache:

    if current_seq_len >= max_seq_len {
        return Err("Maximum sequence length exceeded for compressed cache");
    }
  3. Pass actual memory estimate to device mapping instead of 0.

Acceptance criteria

  • Compressed cache has a configurable maximum sequence length
  • Server mode enforces the limit
  • Device mapping gets a realistic memory estimate
  • Long sequences return a proper error, not OOM crash

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions