-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
To support KV prefix caching as described in the Gateway API Inference Extension prefix-aware EPP configuration, the EPP service needs access to request bodies to compute prefix hashes.
Background:
The reference EPP implementation's prefix cache plugin extracts prompts directly from request.Body (via getUserInputBytes()) to compute prefix hashes for optimal routing. This means EPP services expect to receive the full request body via the Envoy ext-proc protocol.
Current State:
- EPP uses headers-only mode (
end_of_stream: true,request_body_mode: None) - BBR uses
ngx_http_read_client_request_body()to buffer the body into nginx's internal buffer chain - BBR's
read_request_body()function extracts data from nginx buffers (memory + file-backed)
Implementation Strategy:
- Single Body Read: Call
ngx_http_read_client_request_body()once (coordinate between BBR and EPP) - Shared Access: Both modules read from nginx's
request_body->bufschain - Body-Aware gRPC: EPP switches to body-aware mode following Envoy ext-proc protocol:
- Send
ProcessingRequestwithRequestHeaders(setend_of_stream: false) - Send
ProcessingRequestwithRequestBodycontaining the full body - Set
request_body_mode: BodySendMode::Buffered
- Send
Execution Flow:
1. Request arrives
2. If BBR or EPP needs body: Call ngx_http_read_client_request_body() once
3. Body buffered by nginx into buffer chain (memory + file if large)
4. If BBR enabled: Read from nginx buffers, extract model, set header
5. If EPP enabled: Read from same nginx buffers, send via gRPC body message
6. Continue to upstream proxy
Configuration Options to Add:
inference_epp_send_body on|off- enable body-aware mode (default: off for backward compatibility)inference_epp_body_max_size- maximum body size for EPP (default: 10MB, same as BBR)- Consider unified
inference_max_body_sizeif both BBR and EPP are commonly used together
Implementation Details:
- Refactor
read_request_body()into a shared utility function insrc/modules/mod.rs - Both BBR and EPP call the same body reading function
- Handle execution order: if both enabled, read body once, process for both
- EPP sends body in gRPC
RequestBodymessage (standard Envoy ext-proc protocol) - Maintain backward compatibility: headers-only remains default
Trade-offs:
- Higher latency: Must buffer entire body before routing decision (vs headers-only)
- Required for prefix caching: This is the standard protocol, not a design choice
- Memory efficient: No double buffering, shared nginx buffer chain
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels