Skip to content

Commit 859576f

Browse files
initial upload
0 parents  commit 859576f

9 files changed

Lines changed: 555 additions & 0 deletions

BENCHMARKS.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# chonkify Benchmark Context vs LLMLingua and LLMLingua2
2+
3+
This file summarizes the most recent multidocument comparison from the internal document-compression benchmark suite. It is intended to be self-contained inside the minimal `chonkify` handoff folder.
4+
5+
## Suite Scope
6+
7+
- Documents: `5`
8+
- Budgets: `1500`, `1000`
9+
- Comparison metric for the percentage deltas below: mean `composite_info_recovery`
10+
- Best internal baseline on this suite: `cpc_mmr/key_sentence_mmr`
11+
12+
## Composite Recovery Comparison
13+
14+
| Budget | Internal best | LLMLingua | Internal delta vs LLMLingua | LLMLingua2 | Internal delta vs LLMLingua2 |
15+
|---|---:|---:|---:|---:|---:|
16+
| `1500` | `0.4302` | `0.2713` | `+58.61%` | `0.1559` | `+175.99%` |
17+
| `1000` | `0.3312` | `0.1804` | `+83.54%` | `0.1211` | `+173.49%` |
18+
| mean across both budgets | `0.3807` | `0.2258` | `+68.57%` | `0.1385` | `+174.90%` |
19+
20+
## Win Pattern
21+
22+
- Budget `1500`: the internal `cpc_mmr/key_sentence_mmr` method wins `4/5` documents.
23+
- Budget `1000`: the internal `cpc_mmr/key_sentence_mmr` method wins `3/5` documents.
24+
- Across the `10` document-budget cells in this suite, internal methods win `9/10`.
25+
26+
## Important Caveat
27+
28+
The current benchmark recovery scorer used for this comparison still relies on proxy metrics for sentence, heading, and numeric-fact recovery. These numbers are therefore useful as relative comparison evidence, but they are not a fully ground-truth semantic measure.
29+
30+
## Positioning
31+
32+
`chonkify` now packages the actual internal winner path label from that comparison: `cpc_mmr/key_sentence_mmr`. The packaged runtime uses the same CPC/MMR selection logic and benchmark-style key-sentence preparation chain, while still keeping the benchmark scoring harness itself outside the product package.

LICENSE.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Chonkify Evaluation-Only Proprietary License
2+
3+
Copyright (c) Thomas "Thom" Heinrich 2026
4+
Contact: th@chonkydb.com
5+
6+
All rights reserved.
7+
8+
## 1. Limited Grant
9+
10+
Subject to full compliance with this license, you are granted a revocable, non-exclusive, non-transferable, non-sublicensable right to install and use `chonkify` solely for internal, non-production, non-commercial evaluation, testing, and review purposes.
11+
12+
## 2. No Ownership Transfer
13+
14+
`chonkify`, including all source code, documentation, interfaces, algorithms, packaging, and related materials, is licensed and not sold. No ownership rights are transferred.
15+
16+
## 3. Strict Prohibitions
17+
18+
Without prior written permission from Thomas "Thom" Heinrich, you may not:
19+
20+
- use the software in production or for any revenue-generating activity;
21+
- distribute, share, publish, sublicense, lease, lend, sell, or otherwise provide the software to any third party;
22+
- modify, adapt, translate, create derivative works from, or merge the software into other software;
23+
- reverse engineer, decompile, disassemble, or otherwise attempt to derive non-public implementation details, except where such restriction is prohibited by mandatory law;
24+
- remove or obscure copyright, attribution, branding, or license notices;
25+
- use the software to build or improve a competing product or service.
26+
27+
## 4. Compliance and Scope
28+
29+
Any use outside the narrow evaluation scope in Section 1 is unlicensed and prohibited. This includes hosted use, customer delivery, internal production pilots, and operational deployment.
30+
31+
## 5. Termination
32+
33+
This license terminates automatically and immediately if you breach any term. Upon termination, you must stop using the software and destroy all copies in your possession or control.
34+
35+
## 6. Warranty Disclaimer
36+
37+
THE SOFTWARE IS PROVIDED "AS IS" AND "AS AVAILABLE", WITHOUT WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED, OR STATUTORY, INCLUDING WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, NON-INFRINGEMENT, OR PERFORMANCE.
38+
39+
## 7. Liability Limitation
40+
41+
TO THE MAXIMUM EXTENT PERMITTED BY LAW, THOMAS "THOM" HEINRICH SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, CONSEQUENTIAL, OR OTHER DAMAGES ARISING FROM OR RELATING TO THE SOFTWARE OR THIS LICENSE.
42+
43+
## 8. Reservation of Rights
44+
45+
All rights not expressly granted are reserved by Thomas "Thom" Heinrich.
46+
47+
## 9. Contact
48+
49+
Licensing, commercial access, and other permissions require prior written approval:
50+
51+
- Thomas "Thom" Heinrich
52+
- th@chonkydb.com

README.md

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# chonkify
2+
3+
**Extractive document compression that actually preserves what matters.**
4+
5+
chonkify compresses long documents into tight, information-dense context — built for RAG pipelines, agent memory, and anywhere you need to fit more signal into fewer tokens. It uses a proprietary algorithm that consistently outperforms existing compression methods.
6+
7+
By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
8+
9+
![chonkify-logo](chonkify.png)
10+
11+
---
12+
13+
## Why chonkify
14+
15+
Most compression tools optimize for token reduction. chonkify optimizes for **information recovery** — the compressed output retains the facts, structure, and reasoning that downstream models actually need.
16+
17+
In head-to-head multidocument benchmarks against Microsoft's LLMLingua family:
18+
19+
| Budget | chonkify | LLMLingua | LLMLingua2 |
20+
|---|---:|---:|---:|
21+
| 1500 tokens | **0.4302** | 0.2713 | 0.1559 |
22+
| 1000 tokens | **0.3312** | 0.1804 | 0.1211 |
23+
24+
That's **+69% composite information recovery** vs LLMLingua and **+175%** vs LLMLingua2 on average across both budgets, winning 9 out of 10 document-budget cells in the test suite. Full methodology in [RESULTS_vs_LLMLingua.md](RESULTS_vs_LLMLingua.md).
25+
26+
## How It Works
27+
28+
chonkify embeds document content, scores passages by information density and diversity, and extracts the highest-value subset under your token budget. The selection core ships as compiled extension modules — the benchmarks speak for themselves.
29+
30+
## Install
31+
32+
chonkify ships as a compiled, platform-specific Python 3.11 wheel.
33+
34+
```bash
35+
# Linux x86_64
36+
pip install ./chonkify-0.2.2-cp311-cp311-manylinux*_x86_64.whl
37+
38+
# macOS Apple Silicon
39+
pip install ./chonkify-0.2.2-cp311-cp311-macosx*_arm64.whl
40+
41+
# macOS Intel
42+
pip install ./chonkify-0.2.2-cp311-cp311-macosx*_x86_64.whl
43+
44+
# Windows
45+
pip install .\chonkify-0.2.2-cp311-cp311-win_amd64.whl
46+
```
47+
48+
For local CPU/GPU embeddings (no API calls), also install:
49+
50+
```bash
51+
pip install sentence-transformers
52+
```
53+
54+
Or use the optional extra: `pip install chonkify[local]`
55+
56+
## Quick Start
57+
58+
### CLI
59+
60+
```bash
61+
chonkify compress ./paper.pdf \
62+
--target-tokens 1200 \
63+
--output ./paper_compressed.txt \
64+
--metadata-out ./paper_meta.json
65+
```
66+
67+
Multiple documents in one pass:
68+
69+
```bash
70+
chonkify compress ./brief.md ./appendix.pdf \
71+
--target-tokens 1400 \
72+
--output ./bundle.txt
73+
```
74+
75+
Pipe from stdin:
76+
77+
```bash
78+
cat ./notes.txt | chonkify compress - --target-tokens 900 --output -
79+
```
80+
81+
### Python API
82+
83+
```python
84+
from chonkify import compress_documents
85+
86+
# With additional control over embedding providers:
87+
from chonkify import (
88+
LocalEmbeddingConfig,
89+
LocalSentenceTransformerEmbeddingProvider,
90+
OpenAIEmbeddingConfig,
91+
OpenAIEmbeddingProvider,
92+
compress_documents,
93+
)
94+
```
95+
96+
## Embedding Backends
97+
98+
### Azure OpenAI (default)
99+
100+
```bash
101+
export AZURE_OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com/"
102+
export AZURE_OPENAI_API_KEY="<secret>"
103+
export AZURE_OPENAI_API_VERSION="2024-10-21"
104+
export CHONKIFY_AZURE_EMBEDDING_DEPLOYMENT="<deployment-name>"
105+
```
106+
107+
### OpenAI
108+
109+
```bash
110+
export OPENAI_API_KEY="<secret>"
111+
export CHONKIFY_OPENAI_EMBEDDING_MODEL="text-embedding-3-large"
112+
```
113+
114+
```bash
115+
chonkify compress ./paper.pdf --embedding-backend openai --target-tokens 1200
116+
```
117+
118+
### OpenAI-Compatible Endpoints
119+
120+
For providers like Together, Fireworks, or self-hosted APIs:
121+
122+
```bash
123+
export OPENAI_API_KEY="<key>"
124+
export CHONKIFY_OPENAI_BASE_URL="https://<provider>/v1"
125+
export CHONKIFY_OPENAI_EMBEDDING_MODEL="<model-id>"
126+
```
127+
128+
```bash
129+
chonkify compress ./paper.pdf --embedding-backend openai-compatible --target-tokens 1200
130+
```
131+
132+
If your endpoint rejects the `dimensions` parameter, add `--openai-omit-dimensions-parameter`. chonkify still validates 768-dimensional output.
133+
134+
### Local (SentenceTransformers)
135+
136+
Fully offline after first model download. Default model: `sentence-transformers/all-mpnet-base-v2`.
137+
138+
```bash
139+
chonkify compress ./paper.pdf \
140+
--embedding-backend local \
141+
--local-device cuda \
142+
--target-tokens 1200
143+
```
144+
145+
Device options: `cpu`, `cuda`, `cuda:0`, `mps`.
146+
147+
Validated with `sentence-transformers 5.1.0` and `torch 2.8.0+cu128` on NVIDIA RTX 3090. Cold-cache run: ~13s. Warm-cache run: ~6s. Model footprint: ~419 MB. With `HF_HUB_OFFLINE=1`, the local backend runs fully air-gapped once cached.
148+
149+
## Output Metadata
150+
151+
The optional `--metadata-out` JSON includes:
152+
153+
- Original and compressed token counts
154+
- Compression factor and token reduction percentage
155+
- Selected source blocks with source IDs and ranks
156+
- Embedding provider and selection strategy used
157+
158+
If you pass `--query`, it is preserved in metadata for provenance tracking.
159+
160+
## License
161+
162+
chonkify is proprietary software. The current release is licensed for **evaluation, testing, and review only** — not for production use. See [LICENSE.md](LICENSE.md) for full terms.
163+
164+
For commercial licensing, production access, or integration partnerships:
165+
**th@chonkydb.com**
166+
167+
## Benchmark Details
168+
169+
See [BENCHMARKS.md) for the full multidocument comparison methodology and per-document results.
201 KB
Binary file not shown.
181 KB
Binary file not shown.
Binary file not shown.
167 KB
Binary file not shown.

chonkify.png

428 KB
Loading

0 commit comments

Comments
 (0)