|
| 1 | +# chonkify |
| 2 | + |
| 3 | +**Extractive document compression that actually preserves what matters.** |
| 4 | + |
| 5 | +chonkify compresses long documents into tight, information-dense context — built for RAG pipelines, agent memory, and anywhere you need to fit more signal into fewer tokens. It uses a proprietary algorithm that consistently outperforms existing compression methods. |
| 6 | + |
| 7 | +By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com) |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Why chonkify |
| 14 | + |
| 15 | +Most compression tools optimize for token reduction. chonkify optimizes for **information recovery** — the compressed output retains the facts, structure, and reasoning that downstream models actually need. |
| 16 | + |
| 17 | +In head-to-head multidocument benchmarks against Microsoft's LLMLingua family: |
| 18 | + |
| 19 | +| Budget | chonkify | LLMLingua | LLMLingua2 | |
| 20 | +|---|---:|---:|---:| |
| 21 | +| 1500 tokens | **0.4302** | 0.2713 | 0.1559 | |
| 22 | +| 1000 tokens | **0.3312** | 0.1804 | 0.1211 | |
| 23 | + |
| 24 | +That's **+69% composite information recovery** vs LLMLingua and **+175%** vs LLMLingua2 on average across both budgets, winning 9 out of 10 document-budget cells in the test suite. Full methodology in [RESULTS_vs_LLMLingua.md](RESULTS_vs_LLMLingua.md). |
| 25 | + |
| 26 | +## How It Works |
| 27 | + |
| 28 | +chonkify embeds document content, scores passages by information density and diversity, and extracts the highest-value subset under your token budget. The selection core ships as compiled extension modules — the benchmarks speak for themselves. |
| 29 | + |
| 30 | +## Install |
| 31 | + |
| 32 | +chonkify ships as a compiled, platform-specific Python 3.11 wheel. |
| 33 | + |
| 34 | +```bash |
| 35 | +# Linux x86_64 |
| 36 | +pip install ./chonkify-0.2.2-cp311-cp311-manylinux*_x86_64.whl |
| 37 | + |
| 38 | +# macOS Apple Silicon |
| 39 | +pip install ./chonkify-0.2.2-cp311-cp311-macosx*_arm64.whl |
| 40 | + |
| 41 | +# macOS Intel |
| 42 | +pip install ./chonkify-0.2.2-cp311-cp311-macosx*_x86_64.whl |
| 43 | + |
| 44 | +# Windows |
| 45 | +pip install .\chonkify-0.2.2-cp311-cp311-win_amd64.whl |
| 46 | +``` |
| 47 | + |
| 48 | +For local CPU/GPU embeddings (no API calls), also install: |
| 49 | + |
| 50 | +```bash |
| 51 | +pip install sentence-transformers |
| 52 | +``` |
| 53 | + |
| 54 | +Or use the optional extra: `pip install chonkify[local]` |
| 55 | + |
| 56 | +## Quick Start |
| 57 | + |
| 58 | +### CLI |
| 59 | + |
| 60 | +```bash |
| 61 | +chonkify compress ./paper.pdf \ |
| 62 | + --target-tokens 1200 \ |
| 63 | + --output ./paper_compressed.txt \ |
| 64 | + --metadata-out ./paper_meta.json |
| 65 | +``` |
| 66 | + |
| 67 | +Multiple documents in one pass: |
| 68 | + |
| 69 | +```bash |
| 70 | +chonkify compress ./brief.md ./appendix.pdf \ |
| 71 | + --target-tokens 1400 \ |
| 72 | + --output ./bundle.txt |
| 73 | +``` |
| 74 | + |
| 75 | +Pipe from stdin: |
| 76 | + |
| 77 | +```bash |
| 78 | +cat ./notes.txt | chonkify compress - --target-tokens 900 --output - |
| 79 | +``` |
| 80 | + |
| 81 | +### Python API |
| 82 | + |
| 83 | +```python |
| 84 | +from chonkify import compress_documents |
| 85 | + |
| 86 | +# With additional control over embedding providers: |
| 87 | +from chonkify import ( |
| 88 | + LocalEmbeddingConfig, |
| 89 | + LocalSentenceTransformerEmbeddingProvider, |
| 90 | + OpenAIEmbeddingConfig, |
| 91 | + OpenAIEmbeddingProvider, |
| 92 | + compress_documents, |
| 93 | +) |
| 94 | +``` |
| 95 | + |
| 96 | +## Embedding Backends |
| 97 | + |
| 98 | +### Azure OpenAI (default) |
| 99 | + |
| 100 | +```bash |
| 101 | +export AZURE_OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com/" |
| 102 | +export AZURE_OPENAI_API_KEY="<secret>" |
| 103 | +export AZURE_OPENAI_API_VERSION="2024-10-21" |
| 104 | +export CHONKIFY_AZURE_EMBEDDING_DEPLOYMENT="<deployment-name>" |
| 105 | +``` |
| 106 | + |
| 107 | +### OpenAI |
| 108 | + |
| 109 | +```bash |
| 110 | +export OPENAI_API_KEY="<secret>" |
| 111 | +export CHONKIFY_OPENAI_EMBEDDING_MODEL="text-embedding-3-large" |
| 112 | +``` |
| 113 | + |
| 114 | +```bash |
| 115 | +chonkify compress ./paper.pdf --embedding-backend openai --target-tokens 1200 |
| 116 | +``` |
| 117 | + |
| 118 | +### OpenAI-Compatible Endpoints |
| 119 | + |
| 120 | +For providers like Together, Fireworks, or self-hosted APIs: |
| 121 | + |
| 122 | +```bash |
| 123 | +export OPENAI_API_KEY="<key>" |
| 124 | +export CHONKIFY_OPENAI_BASE_URL="https://<provider>/v1" |
| 125 | +export CHONKIFY_OPENAI_EMBEDDING_MODEL="<model-id>" |
| 126 | +``` |
| 127 | + |
| 128 | +```bash |
| 129 | +chonkify compress ./paper.pdf --embedding-backend openai-compatible --target-tokens 1200 |
| 130 | +``` |
| 131 | + |
| 132 | +If your endpoint rejects the `dimensions` parameter, add `--openai-omit-dimensions-parameter`. chonkify still validates 768-dimensional output. |
| 133 | + |
| 134 | +### Local (SentenceTransformers) |
| 135 | + |
| 136 | +Fully offline after first model download. Default model: `sentence-transformers/all-mpnet-base-v2`. |
| 137 | + |
| 138 | +```bash |
| 139 | +chonkify compress ./paper.pdf \ |
| 140 | + --embedding-backend local \ |
| 141 | + --local-device cuda \ |
| 142 | + --target-tokens 1200 |
| 143 | +``` |
| 144 | + |
| 145 | +Device options: `cpu`, `cuda`, `cuda:0`, `mps`. |
| 146 | + |
| 147 | +Validated with `sentence-transformers 5.1.0` and `torch 2.8.0+cu128` on NVIDIA RTX 3090. Cold-cache run: ~13s. Warm-cache run: ~6s. Model footprint: ~419 MB. With `HF_HUB_OFFLINE=1`, the local backend runs fully air-gapped once cached. |
| 148 | + |
| 149 | +## Output Metadata |
| 150 | + |
| 151 | +The optional `--metadata-out` JSON includes: |
| 152 | + |
| 153 | +- Original and compressed token counts |
| 154 | +- Compression factor and token reduction percentage |
| 155 | +- Selected source blocks with source IDs and ranks |
| 156 | +- Embedding provider and selection strategy used |
| 157 | + |
| 158 | +If you pass `--query`, it is preserved in metadata for provenance tracking. |
| 159 | + |
| 160 | +## License |
| 161 | + |
| 162 | +chonkify is proprietary software. The current release is licensed for **evaluation, testing, and review only** — not for production use. See [LICENSE.md](LICENSE.md) for full terms. |
| 163 | + |
| 164 | +For commercial licensing, production access, or integration partnerships: |
| 165 | +**th@chonkydb.com** |
| 166 | + |
| 167 | +## Benchmark Details |
| 168 | + |
| 169 | +See [BENCHMARKS.md) for the full multidocument comparison methodology and per-document results. |
0 commit comments