Feature: Agent-Transparent Context Compression for Multi-Turn Conversations

## Background

In multi-turn agent conversations with tool calls, historical context (especially tool outputs) can quickly consume substantial tokens, leading to increased API costs, slower LLM response times, and context window limitations.

Inspired by Claude's context management approach ([reference](https://www.anthropic.com/news/context-management)), this feature proposes implementing transparent context compression within the AI gateway, requiring no changes to existing agent implementations.

## Objective

Implement automatic context compression that intercepts multi-turn conversations, stores historical tool outputs in external memory, and seamlessly retrieves them when needed—completely transparent to the agent.

## Architecture Overview

```mermaid
sequenceDiagram
    participant Agent
    participant Gateway as AI Gateway
    participant LLM
    participant Memory
    participant Tool

    Agent->>Gateway: Request (Prompt + Tool History)
    Gateway->>Memory: Store tool outputs
    Memory-->>Gateway: context_id
    Gateway->>Gateway: Inject memory tool definition
    Gateway->>LLM: Compressed request
    LLM-->>Gateway: Response (may call read_memory)
    
    alt LLM needs historical context
        Gateway->>Memory: Auto-retrieve context
        Memory-->>Gateway: Historical data
        Gateway->>LLM: Re-request with context
        LLM-->>Gateway: Final response
    end
    
    Gateway-->>Agent: Final response
```

## Implementation Approach

### 1. Memory Tool Definition

Add built-in memory management tools to the plugin's tool registry:

**Tool: `save_context`**
- Stores conversation context/tool outputs
- Returns: `context_id` for later retrieval

**Tool: `read_memory`**
- Parameters: `context_id`
- Returns: Previously stored context

### 2. Request Interception & Compression

**Location:** `onHttpRequestBody` in `main.go`

**Compression Logic:**
1. Parse `messages` array from request body
2. Identify tool call results (role: `tool` or `function`)
3. Store identified content via external memory service
4. Replace original content with context references
5. Inject memory tool definitions into request


### 3. Response Interception & Auto-Retrieval

**Location:** `onStreamingResponseBody` and `onHttpResponseBody` in `main.go`

**Auto-Retrieval Logic:**
1. Parse LLM response for `tool_calls` with type `read_memory`
2. Extract `context_id` parameter
3. Fetch context from memory service
4. Reconstruct request with retrieved context
5. Re-invoke LLM (transparently to agent)
6. Return final response


### 4. Configuration Schema

**Example Configuration:**

```yaml
provider:
  type: openai
  apiTokens:
    - "sk-xxx"
  contextCompression:
    enabled: true
    memoryService:
      redis:
        service_name: redis.static
        service_port: 6379
        username: default
        password: '123456'
        timeout: 1000
        database: 0
    compressionBytesThreshold: 1000  # Only compress if saved bytes exceed this value
```

### 5. Memory Service Integration

**Interface Design:**

```go
type MemoryService interface {
    // Store context and return unique ID
    SaveContext(ctx wrapper.HttpContext, content string) (string, error)
    
    // Retrieve context by ID
    ReadContext(ctx wrapper.HttpContext, contextId string) (string, error)
}
```

**HTTP Client Implementation:**

Use existing `wrapper.DispatchHttpCall()` for memory service communication, similar to the context fetching implementation in `provider/context.go`.

### 6. Cache Awareness (Threshold-Based Compression)

To avoid invalidating KV cache, implement threshold checking:

```go
func shouldCompress(historicalTokens, threshold int) bool {
    // Only compress if token savings exceed threshold
    return historicalTokens > threshold
}
```

**Default threshold:** 1000 bytes (configurable via `compressionBytesThreshold`)

**Rationale:** KV cache invalidation cost must be outweighed by token reduction benefits.

## Implementation Checklist

### Core Components

- [ ] Define `ContextCompression` config structure in `provider/provider.go`
- [ ] Implement `MemoryService` interface with HTTP client
- [ ] Add compression logic in `onHttpRequestBody`
- [ ] Add interception logic in response handlers (`onStreamingResponseBody`, `onHttpResponseBody`)
- [ ] Implement memory tool definitions (save_context, read_memory)
- [ ] Add threshold-based compression gating logic

### Testing

- [ ] Unit tests for compression/decompression logic
- [ ] Integration tests with mock memory service
- [ ] E2E tests with real agent scenarios

### Documentation

- [ ] Update `README_EN.md` with configuration examples
- [ ] Add architecture diagram to documentation
- [ ] Document memory service API requirements
- [ ] Provide example memory service implementations

## Technical Considerations

1. **Async Handling:** Memory service calls should use async dispatch to avoid blocking main request flow
2. **Error Handling:** If memory service fails, fall back to uncompressed mode gracefully
3. **Security:** Ensure context IDs are unpredictable and access-controlled
4. **Observability:** Add metrics for:
   - Compression ratio (tokens saved)
   - Memory service latency
   - Cache hit/miss rates
5. **Streaming Support:** Handle both streaming and non-streaming responses
6. **Multi-Provider:** Ensure compatibility across all supported LLM providers

## Benefits

- **Zero Agent Changes:** Existing agents work without modification
- **Cost Reduction:** Significant token savings on multi-turn conversations with tool calls
- **Performance:** Reduced LLM processing time for shorter contexts
- **Scalability:** Handles arbitrarily long conversation histories
- **Transparency:** Entire compression/retrieval process is invisible to agents

## References

- [Claude Context Management](https://www.anthropic.com/news/context-management)
- [Context Editing Documentation](https://docs.claude.com/en/docs/build-with-claude/context-editing)
- [Memory Tool Use](https://docs.claude.com/en/docs/agents-and-tools/tool-use/memory-tool)

## Related Files

- `plugins/wasm-go/extensions/ai-proxy/main.go` - Main request/response handling
- `plugins/wasm-go/extensions/ai-proxy/provider/provider.go` - Provider configuration
- `plugins/wasm-go/extensions/ai-proxy/provider/context.go` - Existing context handling reference
- `plugins/wasm-go/extensions/ai-proxy/provider/model.go` - Request/response data structures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Agent-Transparent Context Compression for Multi-Turn Conversations #3077

Background

Objective

Architecture Overview

Implementation Approach

1. Memory Tool Definition

2. Request Interception & Compression

3. Response Interception & Auto-Retrieval

4. Configuration Schema

5. Memory Service Integration

6. Cache Awareness (Threshold-Based Compression)

Implementation Checklist

Core Components

Testing

Documentation

Technical Considerations

Benefits

References

Related Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature: Agent-Transparent Context Compression for Multi-Turn Conversations #3077

Description

Background

Objective

Architecture Overview

Implementation Approach

1. Memory Tool Definition

2. Request Interception & Compression

3. Response Interception & Auto-Retrieval

4. Configuration Schema

5. Memory Service Integration

6. Cache Awareness (Threshold-Based Compression)

Implementation Checklist

Core Components

Testing

Documentation

Technical Considerations

Benefits

References

Related Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions