Awesome LLM Token Reduction

A curated list of techniques, tools, and research for reducing LLM token usage — with a focus on AI coding assistants like Claude Code, OpenAI Codex, and GitHub Copilot.

Every prompt and response costs tokens, and coding agents burn through them fast: large files, tool output, logs, and long sessions all inflate the context window. This list collects the drop-in tools, libraries, data formats, and papers that cut tokens while keeping answers intact.

Surveys & Background
Coding-Assistant Token Savers
Prompt Compression Libraries
Token-Efficient Data Formats
Context & Memory Management
Output Compression
Research & Methods
Star History

Surveys & Background

Start here for the lay of the land before picking a technique.

Prompt Compression for Large Language Models: A Survey - Taxonomy of hard- and soft-prompt compression methods, mechanisms, and open problems.

Coding-Assistant Token Savers

Drop-in proxies, plugins, hooks, and MCP servers that cut tokens for Claude Code, Codex, Copilot, Cursor, and Aider.

claude-rolling-context - Claude Code plugin that compresses old messages while keeping recent context verbatim.
claude-shorthand - LLMLingua-2 prompt-compression hook for Claude Code.
ClaudeShrink - Claude Code skill that shrinks large prompts and files with LLMLingua to save tokens.
engram - Local-first context compression for AI coding tools, deduping redundant tokens across calls.
entroly - Local proxy that compresses context for Claude Code, Codex, Cursor, and Aider.
headroom - Compresses tool output, logs, files, and RAG chunks before they reach the LLM.
llmtrim - Provider-agnostic Rust proxy that compresses input, output, and cache with no extra model calls.
rtk - CLI proxy that cuts LLM token use 60-90% on common dev commands, single Rust binary.
sigmap - Zero-dependency MCP server for AST-based code context reduction across 31 languages.
token-optimizer-mcp - Claude Code MCP server reaching 95%+ token reduction through caching and optimization.
token-reducer - Local-first Claude Code context compression using hybrid RAG and AST chunking.
TokenTamer - Drop-in proxy that compresses bloated code context in real time to cut API costs.
tokless - Unified CLI to install and update token-saving plugins for Claude Code, Codex, and OpenCode.

Prompt Compression Libraries

General-purpose SDKs you call directly to compress prompts in any LLM app.

claw-compactor - 14-stage reversible, AST-aware pipeline for LLM token compression with zero inference cost.
leanctx - Drop-in prompt-compression SDK for production LLM apps, built on LLMLingua-2.
LLMLingua - Microsoft toolkit compressing prompts and KV-cache up to 20x with minimal quality loss.
llmlingua-2-js - JavaScript/TypeScript implementation of LLMLingua-2 for browser and Node.

Token-Efficient Data Formats

Compact, LLM-friendly encodings that pass the same data in fewer tokens than JSON.

TOON - Token-Oriented Object Notation, a lossless JSON encoding that cuts tokens ~30-60% for uniform data.
Tooner - MCP proxy that converts JSON tool responses to TOON before they reach the model.

Context & Memory Management

Persist and retrieve only what matters, so sessions stay short instead of replaying everything.

codex-agent-mem - Local-first MCP memory layer for Codex and Claude with compact, token-saving context packs.
mnemosyne - Zero-dependency knowledge compression, ingestion, and hybrid retrieval engine.
Zep - Context engineering platform that assembles relationship-aware context from a temporal knowledge graph.

Output Compression

Reduce generation tokens — the part you pay the most for — without losing the answer.

caveman - Claude Code skill that rewrites output in terse "caveman speak" to cut ~65% of tokens.
scrooge-mode - Output-compression skill for Claude Code and Codex measured on real session output tokens.
squeez - Squeezes verbose LLM agent tool output down to only the relevant lines.

Research & Methods

Foundational papers behind the tools above.

Adapting Language Models to Compress Contexts - AutoCompressors that summarize long contexts into compact summary vectors.
In-Context Autoencoder for Context Compression - ICAE encodes long context into a few memory slots for a frozen LLM.
Learning to Compress Prompts with Gist Tokens - Gisting trains an LM to compress prompts into reusable "gist" tokens, up to 26x.
LLMLingua - Coarse-to-fine prompt compression using a small LM to drop low-information tokens.
LLMLingua-2 - Task-agnostic prompt compression via token classification distilled from GPT-4.
LLoCO: Learning Long Contexts Offline - Offline context compression plus LoRA finetuning for efficient long-context inference.
LongLLMLingua - Prompt compression that mitigates "lost in the middle" and boosts RAG with fewer tokens.

Contributing

Contributions are welcome! Please read the contribution guidelines first. In short: one entry per pull request, one entry per line, keep descriptions concise and present tense (ending with a period), verify the link resolves, and place the entry alphabetically within its section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Awesome LLM Token Reduction

Contents

Surveys & Background

Coding-Assistant Token Savers

Prompt Compression Libraries

Token-Efficient Data Formats

Context & Memory Management

Output Compression

Research & Methods

Contributing

Star History

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Awesome LLM Token Reduction

Contents

Surveys & Background

Coding-Assistant Token Savers

Prompt Compression Libraries

Token-Efficient Data Formats

Context & Memory Management

Output Compression

Research & Methods

Contributing

Star History