Skip to content

KV Prefix Caching Support #11

@cvandesande

Description

@cvandesande

To support KV prefix caching as described in the Gateway API Inference Extension prefix-aware EPP configuration, the EPP service needs access to request bodies to compute prefix hashes.

Background:
The reference EPP implementation's prefix cache plugin extracts prompts directly from request.Body (via getUserInputBytes()) to compute prefix hashes for optimal routing. This means EPP services expect to receive the full request body via the Envoy ext-proc protocol.

Current State:

  • EPP uses headers-only mode (end_of_stream: true, request_body_mode: None)
  • BBR uses ngx_http_read_client_request_body() to buffer the body into nginx's internal buffer chain
  • BBR's read_request_body() function extracts data from nginx buffers (memory + file-backed)

Implementation Strategy:

  1. Single Body Read: Call ngx_http_read_client_request_body() once (coordinate between BBR and EPP)
  2. Shared Access: Both modules read from nginx's request_body->bufs chain
  3. Body-Aware gRPC: EPP switches to body-aware mode following Envoy ext-proc protocol:
    • Send ProcessingRequest with RequestHeaders (set end_of_stream: false)
    • Send ProcessingRequest with RequestBody containing the full body
    • Set request_body_mode: BodySendMode::Buffered

Execution Flow:

1. Request arrives
2. If BBR or EPP needs body: Call ngx_http_read_client_request_body() once
3. Body buffered by nginx into buffer chain (memory + file if large)
4. If BBR enabled: Read from nginx buffers, extract model, set header
5. If EPP enabled: Read from same nginx buffers, send via gRPC body message
6. Continue to upstream proxy

Configuration Options to Add:

  • inference_epp_send_body on|off - enable body-aware mode (default: off for backward compatibility)
  • inference_epp_body_max_size - maximum body size for EPP (default: 10MB, same as BBR)
  • Consider unified inference_max_body_size if both BBR and EPP are commonly used together

Implementation Details:

  • Refactor read_request_body() into a shared utility function in src/modules/mod.rs
  • Both BBR and EPP call the same body reading function
  • Handle execution order: if both enabled, read body once, process for both
  • EPP sends body in gRPC RequestBody message (standard Envoy ext-proc protocol)
  • Maintain backward compatibility: headers-only remains default

Trade-offs:

  • Higher latency: Must buffer entire body before routing decision (vs headers-only)
  • Required for prefix caching: This is the standard protocol, not a design choice
  • Memory efficient: No double buffering, shared nginx buffer chain

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions