Transparent HTTP proxy that compresses agentic conversations before forwarding to upstream APIs. Combines rule-based structural transforms (inspired by context-compactor) with LLMLingua-2 semantic compression.
Claude Code / Codex CLI
|
v
Proxy (:4000)
1. Zone split (frozen / middle / hot)
2. Strip thinking blocks (global, except last)
3. Compact tool results >500 chars (middle zone)
4. Strip narration filler (middle zone)
5. LLMLingua-2 on remaining text (middle zone)
|
v
Upstream API (Anthropic / OpenAI)
Conversations are split into three zones based on turn boundaries:
- Frozen prefix (first N turns) — never modified. Preserves Anthropic prompt caching (cache breakpoints need byte-stable prefixes).
- Hot window (last M turns) — never modified. The model needs recent context for coherent continuation.
- Middle zone — everything between. This is where all compression fires.
- Strip thinking — Remove
thinkingblocks from all assistant messages except the last. These are 40-46% of agentic session tokens. - Compact tool results — Replace large tool results in middle zone with
[Compacted: N lines, M chars]plus first-line preview. Error results preserved. - Strip narration — Remove short filler text ("Let me...", "Sure...", "Great...") from assistant messages.
- LLMLingua-2 — Semantic token-level compression via XLM-RoBERTa-large. Scores and drops low-information tokens from remaining text blocks.
cd experiments/prompt-proxy
uv run proxy.pyModel downloads on first run (~500MB). Subsequent starts load from HuggingFace cache.
ANTHROPIC_BASE_URL=http://localhost:4000 claudeOPENAI_BASE_URL=http://localhost:4000 codex| Env var | Default | Description |
|---|---|---|
PORT |
4000 |
Proxy listen port |
UPSTREAM_ANTHROPIC_URL |
https://api.anthropic.com |
Anthropic API base URL |
UPSTREAM_OPENAI_URL |
https://api.openai.com |
OpenAI API base URL |
COMPRESS |
true |
Enable/disable LLMLingua compression |
LLMLINGUA_RATE |
0.5 |
Target compression rate (lower = more aggressive) |
LLMLINGUA_MIN_CHARS |
200 |
Min text block size for LLMLingua |
FROZEN_PREFIX_TURNS |
2 |
Turns to protect at start |
HOT_WINDOW_TURNS |
4 |
Turns to protect at end |
MIN_MESSAGES |
6 |
Min messages before compression kicks in |
TOOL_RESULT_COMPACT_THRESHOLD |
500 |
Tool result size threshold |
Every proxied response includes x-compacted-* headers:
x-compacted-original-tokens/x-compacted-compressed-tokensx-compacted-tokens-saved/x-compacted-reduction-pctx-compacted-transforms— which transforms firedx-compacted-structural-ms/x-compacted-llmlingua-ms
| Route | Description |
|---|---|
POST /v1/messages |
Anthropic proxy (Claude Code) |
POST /v1/chat/completions |
OpenAI proxy (Codex) |
GET /health |
Health check with config |
* /{path} |
Passthrough to Anthropic upstream |
prompt-proxy/
proxy.py — FastAPI proxy server, routing, upstream forwarding
pipeline.py — Compression orchestration (zone split + transforms + LLMLingua)
transforms.py — Pure rule-based transforms (strip_thinking, compact_tool_results, strip_narration)
compressor.py — LLMLingua-2 wrapper (model loading, compression API)
pyproject.toml — Dependencies
Auto-detects: MPS (Apple Silicon) > CUDA > CPU.