Skip to content

Feature: Agent-Transparent Context Compression for Multi-Turn Conversations #3077

@johnlanni

Description

@johnlanni

Background

In multi-turn agent conversations with tool calls, historical context (especially tool outputs) can quickly consume substantial tokens, leading to increased API costs, slower LLM response times, and context window limitations.

Inspired by Claude's context management approach (reference), this feature proposes implementing transparent context compression within the AI gateway, requiring no changes to existing agent implementations.

Objective

Implement automatic context compression that intercepts multi-turn conversations, stores historical tool outputs in external memory, and seamlessly retrieves them when needed—completely transparent to the agent.

Architecture Overview

sequenceDiagram
    participant Agent
    participant Gateway as AI Gateway
    participant LLM
    participant Memory
    participant Tool

    Agent->>Gateway: Request (Prompt + Tool History)
    Gateway->>Memory: Store tool outputs
    Memory-->>Gateway: context_id
    Gateway->>Gateway: Inject memory tool definition
    Gateway->>LLM: Compressed request
    LLM-->>Gateway: Response (may call read_memory)
    
    alt LLM needs historical context
        Gateway->>Memory: Auto-retrieve context
        Memory-->>Gateway: Historical data
        Gateway->>LLM: Re-request with context
        LLM-->>Gateway: Final response
    end
    
    Gateway-->>Agent: Final response
Loading

Implementation Approach

1. Memory Tool Definition

Add built-in memory management tools to the plugin's tool registry:

Tool: save_context

  • Stores conversation context/tool outputs
  • Returns: context_id for later retrieval

Tool: read_memory

  • Parameters: context_id
  • Returns: Previously stored context

2. Request Interception & Compression

Location: onHttpRequestBody in main.go

Compression Logic:

  1. Parse messages array from request body
  2. Identify tool call results (role: tool or function)
  3. Store identified content via external memory service
  4. Replace original content with context references
  5. Inject memory tool definitions into request

3. Response Interception & Auto-Retrieval

Location: onStreamingResponseBody and onHttpResponseBody in main.go

Auto-Retrieval Logic:

  1. Parse LLM response for tool_calls with type read_memory
  2. Extract context_id parameter
  3. Fetch context from memory service
  4. Reconstruct request with retrieved context
  5. Re-invoke LLM (transparently to agent)
  6. Return final response

4. Configuration Schema

Example Configuration:

provider:
  type: openai
  apiTokens:
    - "sk-xxx"
  contextCompression:
    enabled: true
    memoryService:
      redis:
        service_name: redis.static
        service_port: 6379
        username: default
        password: '123456'
        timeout: 1000
        database: 0
    compressionBytesThreshold: 1000  # Only compress if saved bytes exceed this value

5. Memory Service Integration

Interface Design:

type MemoryService interface {
    // Store context and return unique ID
    SaveContext(ctx wrapper.HttpContext, content string) (string, error)
    
    // Retrieve context by ID
    ReadContext(ctx wrapper.HttpContext, contextId string) (string, error)
}

HTTP Client Implementation:

Use existing wrapper.DispatchHttpCall() for memory service communication, similar to the context fetching implementation in provider/context.go.

6. Cache Awareness (Threshold-Based Compression)

To avoid invalidating KV cache, implement threshold checking:

func shouldCompress(historicalTokens, threshold int) bool {
    // Only compress if token savings exceed threshold
    return historicalTokens > threshold
}

Default threshold: 1000 bytes (configurable via compressionBytesThreshold)

Rationale: KV cache invalidation cost must be outweighed by token reduction benefits.

Implementation Checklist

Core Components

  • Define ContextCompression config structure in provider/provider.go
  • Implement MemoryService interface with HTTP client
  • Add compression logic in onHttpRequestBody
  • Add interception logic in response handlers (onStreamingResponseBody, onHttpResponseBody)
  • Implement memory tool definitions (save_context, read_memory)
  • Add threshold-based compression gating logic

Testing

  • Unit tests for compression/decompression logic
  • Integration tests with mock memory service
  • E2E tests with real agent scenarios

Documentation

  • Update README_EN.md with configuration examples
  • Add architecture diagram to documentation
  • Document memory service API requirements
  • Provide example memory service implementations

Technical Considerations

  1. Async Handling: Memory service calls should use async dispatch to avoid blocking main request flow
  2. Error Handling: If memory service fails, fall back to uncompressed mode gracefully
  3. Security: Ensure context IDs are unpredictable and access-controlled
  4. Observability: Add metrics for:
    • Compression ratio (tokens saved)
    • Memory service latency
    • Cache hit/miss rates
  5. Streaming Support: Handle both streaming and non-streaming responses
  6. Multi-Provider: Ensure compatibility across all supported LLM providers

Benefits

  • Zero Agent Changes: Existing agents work without modification
  • Cost Reduction: Significant token savings on multi-turn conversations with tool calls
  • Performance: Reduced LLM processing time for shorter contexts
  • Scalability: Handles arbitrarily long conversation histories
  • Transparency: Entire compression/retrieval process is invisible to agents

References

Related Files

  • plugins/wasm-go/extensions/ai-proxy/main.go - Main request/response handling
  • plugins/wasm-go/extensions/ai-proxy/provider/provider.go - Provider configuration
  • plugins/wasm-go/extensions/ai-proxy/provider/context.go - Existing context handling reference
  • plugins/wasm-go/extensions/ai-proxy/provider/model.go - Request/response data structures

Metadata

Metadata

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions