A curated list of techniques, tools, and research for reducing LLM token usage — with a focus on AI coding assistants like Claude Code, OpenAI Codex, and GitHub Copilot.
Every prompt and response costs tokens, and coding agents burn through them fast: large files, tool output, logs, and long sessions all inflate the context window. This list collects the drop-in tools, libraries, data formats, and papers that cut tokens while keeping answers intact.
- Surveys & Background
- Coding-Assistant Token Savers
- Prompt Compression Libraries
- Token-Efficient Data Formats
- Context & Memory Management
- Output Compression
- Research & Methods
- Star History
Start here for the lay of the land before picking a technique.
- Prompt Compression for Large Language Models: A Survey - Taxonomy of hard- and soft-prompt compression methods, mechanisms, and open problems.
Drop-in proxies, plugins, hooks, and MCP servers that cut tokens for Claude Code, Codex, Copilot, Cursor, and Aider.
- claude-rolling-context - Claude Code plugin that compresses old messages while keeping recent context verbatim.
- claude-shorthand - LLMLingua-2 prompt-compression hook for Claude Code.
- ClaudeShrink - Claude Code skill that shrinks large prompts and files with LLMLingua to save tokens.
- engram - Local-first context compression for AI coding tools, deduping redundant tokens across calls.
- entroly - Local proxy that compresses context for Claude Code, Codex, Cursor, and Aider.
- headroom - Compresses tool output, logs, files, and RAG chunks before they reach the LLM.
- llmtrim - Provider-agnostic Rust proxy that compresses input, output, and cache with no extra model calls.
- rtk - CLI proxy that cuts LLM token use 60-90% on common dev commands, single Rust binary.
- sigmap - Zero-dependency MCP server for AST-based code context reduction across 31 languages.
- token-optimizer-mcp - Claude Code MCP server reaching 95%+ token reduction through caching and optimization.
- token-reducer - Local-first Claude Code context compression using hybrid RAG and AST chunking.
- TokenTamer - Drop-in proxy that compresses bloated code context in real time to cut API costs.
- tokless - Unified CLI to install and update token-saving plugins for Claude Code, Codex, and OpenCode.
General-purpose SDKs you call directly to compress prompts in any LLM app.
- claw-compactor - 14-stage reversible, AST-aware pipeline for LLM token compression with zero inference cost.
- leanctx - Drop-in prompt-compression SDK for production LLM apps, built on LLMLingua-2.
- LLMLingua - Microsoft toolkit compressing prompts and KV-cache up to 20x with minimal quality loss.
- llmlingua-2-js - JavaScript/TypeScript implementation of LLMLingua-2 for browser and Node.
Compact, LLM-friendly encodings that pass the same data in fewer tokens than JSON.
- TOON - Token-Oriented Object Notation, a lossless JSON encoding that cuts tokens ~30-60% for uniform data.
- Tooner - MCP proxy that converts JSON tool responses to TOON before they reach the model.
Persist and retrieve only what matters, so sessions stay short instead of replaying everything.
- codex-agent-mem - Local-first MCP memory layer for Codex and Claude with compact, token-saving context packs.
- mnemosyne - Zero-dependency knowledge compression, ingestion, and hybrid retrieval engine.
- Zep - Context engineering platform that assembles relationship-aware context from a temporal knowledge graph.
Reduce generation tokens — the part you pay the most for — without losing the answer.
- caveman - Claude Code skill that rewrites output in terse "caveman speak" to cut ~65% of tokens.
- scrooge-mode - Output-compression skill for Claude Code and Codex measured on real session output tokens.
- squeez - Squeezes verbose LLM agent tool output down to only the relevant lines.
Foundational papers behind the tools above.
- Adapting Language Models to Compress Contexts - AutoCompressors that summarize long contexts into compact summary vectors.
- In-Context Autoencoder for Context Compression - ICAE encodes long context into a few memory slots for a frozen LLM.
- Learning to Compress Prompts with Gist Tokens - Gisting trains an LM to compress prompts into reusable "gist" tokens, up to 26x.
- LLMLingua - Coarse-to-fine prompt compression using a small LM to drop low-information tokens.
- LLMLingua-2 - Task-agnostic prompt compression via token classification distilled from GPT-4.
- LLoCO: Learning Long Contexts Offline - Offline context compression plus LoRA finetuning for efficient long-context inference.
- LongLLMLingua - Prompt compression that mitigates "lost in the middle" and boosts RAG with fewer tokens.
Contributions are welcome! Please read the contribution guidelines first. In short: one entry per pull request, one entry per line, keep descriptions concise and present tense (ending with a period), verify the link resolves, and place the entry alphabetically within its section.