- 
                Notifications
    
You must be signed in to change notification settings  - Fork 865
 
Description
Background
In multi-turn agent conversations with tool calls, historical context (especially tool outputs) can quickly consume substantial tokens, leading to increased API costs, slower LLM response times, and context window limitations.
Inspired by Claude's context management approach (reference), this feature proposes implementing transparent context compression within the AI gateway, requiring no changes to existing agent implementations.
Objective
Implement automatic context compression that intercepts multi-turn conversations, stores historical tool outputs in external memory, and seamlessly retrieves them when needed—completely transparent to the agent.
Architecture Overview
sequenceDiagram
    participant Agent
    participant Gateway as AI Gateway
    participant LLM
    participant Memory
    participant Tool
    Agent->>Gateway: Request (Prompt + Tool History)
    Gateway->>Memory: Store tool outputs
    Memory-->>Gateway: context_id
    Gateway->>Gateway: Inject memory tool definition
    Gateway->>LLM: Compressed request
    LLM-->>Gateway: Response (may call read_memory)
    
    alt LLM needs historical context
        Gateway->>Memory: Auto-retrieve context
        Memory-->>Gateway: Historical data
        Gateway->>LLM: Re-request with context
        LLM-->>Gateway: Final response
    end
    
    Gateway-->>Agent: Final response
    Implementation Approach
1. Memory Tool Definition
Add built-in memory management tools to the plugin's tool registry:
Tool: save_context
- Stores conversation context/tool outputs
 - Returns: 
context_idfor later retrieval 
Tool: read_memory
- Parameters: 
context_id - Returns: Previously stored context
 
2. Request Interception & Compression
Location: onHttpRequestBody in main.go
Compression Logic:
- Parse 
messagesarray from request body - Identify tool call results (role: 
toolorfunction) - Store identified content via external memory service
 - Replace original content with context references
 - Inject memory tool definitions into request
 
3. Response Interception & Auto-Retrieval
Location: onStreamingResponseBody and onHttpResponseBody in main.go
Auto-Retrieval Logic:
- Parse LLM response for 
tool_callswith typeread_memory - Extract 
context_idparameter - Fetch context from memory service
 - Reconstruct request with retrieved context
 - Re-invoke LLM (transparently to agent)
 - Return final response
 
4. Configuration Schema
Example Configuration:
provider:
  type: openai
  apiTokens:
    - "sk-xxx"
  contextCompression:
    enabled: true
    memoryService:
      redis:
        service_name: redis.static
        service_port: 6379
        username: default
        password: '123456'
        timeout: 1000
        database: 0
    compressionBytesThreshold: 1000  # Only compress if saved bytes exceed this value5. Memory Service Integration
Interface Design:
type MemoryService interface {
    // Store context and return unique ID
    SaveContext(ctx wrapper.HttpContext, content string) (string, error)
    
    // Retrieve context by ID
    ReadContext(ctx wrapper.HttpContext, contextId string) (string, error)
}HTTP Client Implementation:
Use existing wrapper.DispatchHttpCall() for memory service communication, similar to the context fetching implementation in provider/context.go.
6. Cache Awareness (Threshold-Based Compression)
To avoid invalidating KV cache, implement threshold checking:
func shouldCompress(historicalTokens, threshold int) bool {
    // Only compress if token savings exceed threshold
    return historicalTokens > threshold
}Default threshold: 1000 bytes (configurable via compressionBytesThreshold)
Rationale: KV cache invalidation cost must be outweighed by token reduction benefits.
Implementation Checklist
Core Components
-  Define 
ContextCompressionconfig structure inprovider/provider.go -  Implement 
MemoryServiceinterface with HTTP client -  Add compression logic in 
onHttpRequestBody -  Add interception logic in response handlers (
onStreamingResponseBody,onHttpResponseBody) - Implement memory tool definitions (save_context, read_memory)
 - Add threshold-based compression gating logic
 
Testing
- Unit tests for compression/decompression logic
 - Integration tests with mock memory service
 - E2E tests with real agent scenarios
 
Documentation
-  Update 
README_EN.mdwith configuration examples - Add architecture diagram to documentation
 - Document memory service API requirements
 - Provide example memory service implementations
 
Technical Considerations
- Async Handling: Memory service calls should use async dispatch to avoid blocking main request flow
 - Error Handling: If memory service fails, fall back to uncompressed mode gracefully
 - Security: Ensure context IDs are unpredictable and access-controlled
 - Observability: Add metrics for:
- Compression ratio (tokens saved)
 - Memory service latency
 - Cache hit/miss rates
 
 - Streaming Support: Handle both streaming and non-streaming responses
 - Multi-Provider: Ensure compatibility across all supported LLM providers
 
Benefits
- Zero Agent Changes: Existing agents work without modification
 - Cost Reduction: Significant token savings on multi-turn conversations with tool calls
 - Performance: Reduced LLM processing time for shorter contexts
 - Scalability: Handles arbitrarily long conversation histories
 - Transparency: Entire compression/retrieval process is invisible to agents
 
References
Related Files
plugins/wasm-go/extensions/ai-proxy/main.go- Main request/response handlingplugins/wasm-go/extensions/ai-proxy/provider/provider.go- Provider configurationplugins/wasm-go/extensions/ai-proxy/provider/context.go- Existing context handling referenceplugins/wasm-go/extensions/ai-proxy/provider/model.go- Request/response data structures
Metadata
Metadata
Assignees
Type
Projects
Status