Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,16 @@ location /v1/chat/completions {
}
```

#### `inference_bbr_max_body_size`
#### `inference_max_body_size`

- **Syntax**: `inference_bbr_max_body_size <bytes>`
- **Syntax**: `inference_max_body_size <bytes>`
- **Default**: `10485760` (10MB)
- **Context**: `http`, `server`, `location`
- **Description**: Maximum request body size for processing (applies to both BBR and EPP)

Sets the maximum request body size (in bytes) that BBR will process. Requests larger than this size will be rejected with HTTP 413 (if fail-closed) or skipped (if fail-open).

**Example**:
```nginx
inference_bbr_max_body_size 52428800; # 50MB
inference_max_body_size 52428800; # 50MB
```

#### `inference_bbr_header_name`
Expand Down Expand Up @@ -154,7 +154,7 @@ server {

location /v1/chat/completions {
inference_bbr on;
inference_bbr_max_body_size 20971520; # 20MB
inference_max_body_size 20971520; # 20MB

proxy_pass http://ai-backend:8080;
proxy_set_header Host $host;
Expand All @@ -173,7 +173,7 @@ server {
location /v1/chat/completions {
# Enable BBR for model extraction
inference_bbr on;
inference_bbr_max_body_size 104857600; # 100MB
inference_max_body_size 104857600; # 100MB
inference_bbr_failure_mode_allow off; # Fail-closed

# Enable EPP for intelligent routing
Expand Down Expand Up @@ -230,7 +230,7 @@ error_log /var/log/nginx/error.log warn;

### Performance

1. **Body Size Limits**: Set appropriate `inference_bbr_max_body_size` based on your AI model requirements
1. **Body Size Limits**: Set appropriate `inference_max_body_size` based on your AI model requirements
2. **Timeouts**: Configure `inference_epp_timeout_ms` to balance responsiveness and reliability
3. **Connection Pooling**: Use `keepalive` directives in upstream blocks for better performance

Expand All @@ -253,4 +253,4 @@ error_log /var/log/nginx/error.log warn;
1. **Module not loading**: Check module path and NGINX configuration syntax
2. **BBR not extracting models**: Verify JSON request body format and content-type headers
3. **EPP connection failures**: Check external processor service availability and network connectivity
4. **High memory usage**: Adjust `inference_bbr_max_body_size` and implement proper body size limits
4. **High memory usage**: Adjust `inference_max_body_size` and implement proper body size limits
296 changes: 296 additions & 0 deletions docs/epp-async-architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
# EPP Non-Blocking Architecture

## Overview

The EPP (Endpoint Picker Processor) module implements a **non-blocking asynchronous architecture** that allows NGINX workers to remain responsive while EPP requests are processed in parallel via gRPC.

## Architecture Flow

```mermaid
sequenceDiagram
participant Client
participant NGINX Worker
participant Tokio Runtime
participant EPP gRPC Service

Client->>NGINX Worker: HTTP Request
NGINX Worker->>NGINX Worker: Read request body
NGINX Worker->>Tokio Runtime: Spawn async EPP task
NGINX Worker->>NGINX Worker: Setup timer (1ms poll)
NGINX Worker->>NGINX Worker: Return NGX_DONE (suspend)
Note over NGINX Worker: Worker is FREE to handle other requests

Tokio Runtime->>EPP gRPC Service: gRPC ext-proc call
EPP gRPC Service-->>Tokio Runtime: Upstream decision
Tokio Runtime->>Tokio Runtime: Send via channel

loop Every 1ms
NGINX Worker->>NGINX Worker: Timer fires
NGINX Worker->>NGINX Worker: Poll channel (non-blocking)
alt Result ready
NGINX Worker->>NGINX Worker: Set upstream header
NGINX Worker->>NGINX Worker: Resume request processing
else Not ready
NGINX Worker->>NGINX Worker: Reschedule timer
end
end

NGINX Worker->>Client: Proxied response
```

## Key Components

### 1. Global Tokio Runtime (`src/epp/async_processor.rs`)

```rust
lazy_static! {
pub static ref RUNTIME: tokio::runtime::Runtime =
tokio::runtime::Builder::new_multi_thread()
.worker_threads(4)
.thread_name("epp-worker")
.enable_all()
.build()
.expect("Failed to create Tokio runtime");
}
```

- **4 worker threads** for parallel EPP processing
- Shared across all NGINX workers
- Handles gRPC I/O asynchronously

### 2. Request Processing Flow (`src/epp/callbacks.rs`)

#### Entry Point: `process_with_existing_body()`
```rust
pub fn process_with_existing_body(
request: &mut ngx::http::Request,
ctx: AsyncEppContext,
) -> core::Status
```

**Actions:**
1. Extract request body (from memory or file buffers)
2. Create oneshot channel for result
3. Spawn Tokio task
4. Setup timer for result polling
5. Return `NGX_DONE` to suspend request

#### Async Task: `spawn_epp_task()`
```rust
pub fn spawn_epp_task(
ctx: AsyncEppContext,
body: Vec<u8>,
sender: oneshot::Sender<Result<String, String>>,
)
```

**Actions:**
1. Connect to EPP gRPC endpoint
2. Create bidirectional stream
3. Send request headers
4. Send request body
5. Receive upstream selection
6. Send result via channel

#### Timer Callback: `check_epp_result()`
```rust
unsafe extern "C" fn check_epp_result(ev: *mut ngx_event_t)
```

**Actions:**
1. Poll channel (non-blocking `try_recv()`)
2. If ready: Set header, resume request
3. If not ready: Reschedule timer (1ms)
4. If error: Handle failure (fail-open/fail-closed)

### 3. Memory Management - **The Connection Pool Solution**

#### The Challenge
NGINX has three memory pool types:
- **Request pool**: Freed when request completes
- **Connection pool**: Freed when connection closes
- **Heap**: Manually managed

#### The Solution
```rust
// Allocate timer event from CONNECTION pool
let conn_pool = unsafe { (*conn).pool };
let event_ptr = unsafe {
ngx::ffi::ngx_pcalloc(conn_pool, std::mem::size_of::<ngx_event_t>())
as *mut ngx_event_t
};
```

**Why This Works:**
- ✅ Connection pool lives longer than requests
- ✅ Automatically freed when connection closes
- ✅ No manual memory management needed
- ✅ No memory leaks
- ✅ No use-after-free bugs

**Why Request Pool Failed:**
- ❌ Request completes before EPP call finishes
- ❌ Pool freed while timer still active
- ❌ Timer callback accesses freed memory → segfault

**Why Heap Allocation Failed:**
- ❌ Can't free from within callback (use-after-free)
- ❌ NGINX event loop accesses event after callback returns
- ❌ Led to 128-byte leak per request

### 4. Thread Safety Model

```mermaid
graph TB
subgraph "NGINX Worker Thread"
A[Request Handler]
B[Timer Callback]
C[NGINX APIs]
end

subgraph "Tokio Thread Pool"
D[EPP Task 1]
E[EPP Task 2]
F[EPP Task N]
end

G[Oneshot Channel]

A -->|Spawn| D
A -->|Create| G
D -->|Send Result| G
B -->|Poll| G
B -->|Call| C

style A fill:#90EE90
style B fill:#90EE90
style C fill:#FFB6C1
style D fill:#87CEEB
style E fill:#87CEEB
style F fill:#87CEEB
style G fill:#FFD700
```

**Rules:**
1. **NGINX APIs** can ONLY be called from worker thread context
2. **Tokio tasks** run on separate threads and CANNOT call NGINX APIs
3. **Communication** happens via lock-free oneshot channels
4. **Timer callbacks** run in worker context and CAN call NGINX APIs

## Performance Characteristics

### Non-Blocking Benefits

| Scenario | Blocking | Non-Blocking |
|----------|----------|--------------|
| EPP call time | 50ms | 50ms |
| Worker blocked | Yes | No |
| Concurrent EPP requests per worker | 1 | 1000+ |
| Other requests affected | Yes (queued) | No |
| Throughput under load | Poor | Excellent |

### Overhead

- **Timer polling**: <1% CPU (1ms intervals)
- **Memory per request**: ~256 bytes (watcher + channel)
- **Context**: Automatically freed with connection
- **Latency**: +1-2ms worst case (timer granularity)

## Configuration

```nginx
location / {
# Enable EPP
epp_enable on;

# EPP endpoint
epp_endpoint "localhost:9001";

# Timeout for EPP call
epp_timeout_ms 5000;

# Header to set with upstream selection
epp_header_name "X-Inference-Upstream";

# TLS configuration
epp_tls off;
# epp_ca_file "/path/to/ca.crt";

# Failure mode
epp_failure_mode_allow on; # fail-open
default_upstream "default-backend";
}
```

## Error Handling

### Failure Modes

**Fail-Open** (`epp_failure_mode_allow on`):
- EPP call fails → Use `default_upstream`
- EPP timeout → Use `default_upstream`
- Connection error → Use `default_upstream`

**Fail-Closed** (`epp_failure_mode_allow off`):
- Any EPP error → Return 500 to client
- Safer but less available

### Timeout Handling

```rust
// In Tokio task
tokio::time::timeout(
Duration::from_millis(ctx.timeout_ms),
epp_call()
).await
```

If timeout expires:
- Channel receives `Err("timeout")`
- Timer callback handles via failure mode

## Integration with BBR

When both BBR and EPP are enabled:

```mermaid
flowchart LR
A[Request] --> B{BBR Enabled?}
B -->|Yes| C[Read Body]
B -->|No| D{EPP Enabled?}
C --> E[Extract Model]
E --> F{EPP Enabled?}
F -->|Yes| G[EPP with Existing Body]
F -->|No| H[Set BBR Header]
D -->|Yes| I[EPP Read Body Async]
D -->|No| J[Normal Processing]
G --> K[Async EPP Processing]
I --> K
K --> L[Set EPP Header]
L --> J
H --> J
```

**Key Points:**
- BBR reads body first (synchronous, fast)
- EPP reuses already-read body (no second read)
- Both headers set before proxying

## Success Criteria

✅ **All Met:**
- NGINX workers never block on EPP I/O
- Request body correctly passed to EPP
- Upstream header correctly set from EPP response
- Graceful degradation on EPP failures
- No memory leaks (verified with valgrind)
- No segfaults under load
- Timer overhead <1% CPU
- Compatible with BBR module

## Future Optimizations

1. **Adaptive timer intervals**: Start at 1ms, back off to 10ms for slow EPP
2. **Connection pool sharing**: Reuse gRPC connections across requests
3. **Batch processing**: Group multiple EPP requests if possible
4. **Metrics**: Add Prometheus metrics for EPP latency/errors
6 changes: 3 additions & 3 deletions docs/examples/advanced-config/nginx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ http {
location /v1/chat/completions {
# Enable Body-Based Routing to extract model from request
inference_bbr on;
inference_bbr_max_body_size 104857600; # 100MB
inference_max_body_size 104857600; # 100MB
inference_bbr_header_name X-Gateway-Model-Name;
inference_bbr_failure_mode_allow off; # Fail closed for production

Expand Down Expand Up @@ -95,7 +95,7 @@ http {
# Embeddings endpoint with BBR only
location /v1/embeddings {
inference_bbr on;
inference_bbr_max_body_size 10485760; # 10MB (embeddings typically smaller)
inference_max_body_size 10485760; # 10MB (embeddings typically smaller)
inference_bbr_header_name X-Gateway-Model-Name;

# Static routing to embedding-specific backend
Expand Down Expand Up @@ -151,7 +151,7 @@ http {
# Same location blocks as HTTP server
location /v1/chat/completions {
inference_bbr on;
inference_bbr_max_body_size 104857600;
inference_max_body_size 104857600;
inference_bbr_header_name X-Gateway-Model-Name;
inference_bbr_failure_mode_allow off;

Expand Down
2 changes: 1 addition & 1 deletion docs/examples/basic-config/nginx.conf
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ http {
location /v1/chat/completions {
# Enable BBR to extract model from JSON body
inference_bbr on;
inference_bbr_max_body_size 20971520; # 20MB
inference_max_body_size 20971520; # 20MB
inference_bbr_header_name X-Gateway-Model-Name;

# Fail open if BBR encounters errors
Expand Down
Loading
Loading