You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# chonkify Benchmark Context vs LLMLingua and LLMLingua2
1
+
# chonkify Benchmark Snapshot vs LLMLingua and LLMLingua2
2
2
3
-
This file summarizes the most recent multidocument comparison from the internal document-compression benchmark suite. It is intended to be self-contained inside the minimal `chonkify` handoff folder.
3
+
This handoff packages the current non-PDF release evidence for `chonkify`.
4
4
5
-
## Suite Scope
5
+
## Suite A: General `txt/md` Compression (`20` cases)
6
6
7
-
- Documents: `5`
8
-
- Budgets: `1500`, `1000`
9
-
- Comparison metric for the percentage deltas below: mean `composite_info_recovery`
10
-
- Best internal baseline on this suite: `cpc_mmr/key_sentence_mmr`
Interpretation: `LLMLingua` v1 can keep more raw facts on the very smallest texts only by violating the requested token budget and, on aggregate, expanding the input. `chonkify` is the budget-valid line on this corridor.
13
14
14
-
| Budget | Internal best | LLMLingua | Internal delta vs LLMLingua | LLMLingua2 | Internal delta vs LLMLingua2 |
- Budget `1500`: the internal `cpc_mmr/key_sentence_mmr` method wins `4/5` documents.
23
-
- Budget `1000`: the internal `cpc_mmr/key_sentence_mmr` method wins `3/5` documents.
24
-
- Across the `10` document-budget cells in this suite, internal methods win `9/10`.
23
+
Interpretation: on fact-heavy corpora the current `chonkify` release is both smaller and higher quality than both `LLMLingua` variants.
25
24
26
-
## Important Caveat
25
+
## Combined Token Savings
27
26
28
-
The current benchmark recovery scorer used for this comparison still relies on proxy metrics for sentence, heading, and numeric-fact recovery. These numbers are therefore useful as relative comparison evidence, but they are not a fully ground-truth semantic measure.
`chonkify` now packages the actual internal winner path label from that comparison: `cpc_mmr/key_sentence_mmr`. The packaged runtime uses the same CPC/MMR selection logic and benchmark-style key-sentence preparation chain, while still keeping the benchmark scoring harness itself outside the product package.
35
+
These two non-PDF benchmark corridors are the headline public evidence for this handoff and reflect the current `0.3.0` release line under hard budget constraints.
Copy file name to clipboardExpand all lines: README.md
+37-19Lines changed: 37 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,49 +2,53 @@
2
2
3
3
**Extractive document compression that actually preserves what matters.**
4
4
5
-
chonkify compresses long documents into tight, information-dense context — built for RAG pipelines, agent memory, and anywhere you need to fit more signal into fewer tokens. It uses a proprietary algorithm that consistently outperforms existing compression methods.
5
+
chonkify compresses long documents into tight, information-dense context for RAG pipelines, agent memory, and any workflow where token budget matters as much as factual recovery. This release focuses on strong factual recovery under hard token budgets across general `txt`/`md` and fact-heavy document workloads.
6
6
7
-
By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
7
+
Today, the clearest validated fit is content-dense non-PDF text: quantitative research digests, structured engineering notes, and reasoning traces where downstream models need exact facts more than fluent paraphrase. It remains a general-purpose document compressor, but this is the workload family where the current release is strongest.
8
8
9
-

9
+
By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
10
10
11
11
---
12
12
13
13
## Why chonkify
14
14
15
15
Most compression tools optimize for token reduction. chonkify optimizes for **information recovery** — the compressed output retains the facts, structure, and reasoning that downstream models actually need.
16
16
17
-
In head-to-head multidocument benchmarks against Microsoft's LLMLingua family:
17
+
On the current release corridors against Microsoft's LLMLingua family:
18
18
19
-
|Budget| chonkify | LLMLingua | LLMLingua2 |
19
+
|Suite| chonkify | LLMLingua | LLMLingua2 |
20
20
|---|---:|---:|---:|
21
-
| 1500 tokens |**0.4302**| 0.2713 | 0.1559 |
22
-
| 1000 tokens |**0.3312**| 0.1804 | 0.1211 |
21
+
| general `txt/md` (`20` cases), `fact_recall_mean`|**0.8833**| 1.0000 | 0.8667 |
22
+
| general `txt/md`, `budget_ok_rate`|**1.0000**| 0.0000 | 0.3500 |
That's **+69% composite information recovery** vs LLMLingua and **+175%**vs LLMLingua2 on average across both budgets, winning 9 out of 10 document-budget cells in the test suite. Full methodology in [BENCHMARKS.md](BENCHMARKS.md).
26
+
Across both suites combined, `chonkify` currently saves **75.20%**of source tokens, versus **62.95%** for `LLMLingua` and **62.76%** for `LLMLingua2`. Full methodology and caveats are in [BENCHMARKS.md](BENCHMARKS.md).
25
27
26
28
## How It Works
27
29
28
-
chonkify embeds document content, scores passages by information density and diversity, and extracts the highest-value subset under your token budget. The selection core ships as compiled extension modules — the benchmarks speak for themselves.
30
+
chonkify builds source-faithful document units, scores them through a strict `768`-dimensional embedding interface, and returns a compact output that respects your token budget. Performance-sensitive implementation ships as compiled extension modules.
29
31
30
32
## Install
31
33
32
-
chonkify ships as a compiled, platform-specific Python 3.11 wheel.
34
+
This refreshed handoff includes the current native `cp311` wheel matrix for the supported desktop/server targets:
These four wheels were produced by the native GitHub Actions matrix run `23559149680`, and the Linux manylinux artifact was revalidated afterwards with a fresh-venv `ci/wheel_smoke.py` install smoke.
51
+
48
52
For local CPU/GPU embeddings (no API calls), also install:
49
53
50
54
```bash
@@ -93,6 +97,20 @@ from chonkify import (
93
97
)
94
98
```
95
99
100
+
Minimal example:
101
+
102
+
```python
103
+
from chonkify import compress_documents
104
+
105
+
result = compress_documents(
106
+
["Quarterly revenue rose 18%. Operating margin expanded to 27%. Guidance remains unchanged."],
107
+
target_tokens=24,
108
+
)
109
+
110
+
print(result.compressed_text)
111
+
print(result.compressed_tokens)
112
+
```
113
+
96
114
## Embedding Backends
97
115
98
116
### Azure OpenAI (default)
@@ -153,7 +171,7 @@ The optional `--metadata-out` JSON includes:
153
171
- Original and compressed token counts
154
172
- Compression factor and token reduction percentage
155
173
- Selected source blocks with source IDs and ranks
156
-
- Embedding provider and selection strategy used
174
+
- Embedding provider used
157
175
158
176
If you pass `--query`, it is preserved in metadata for provenance tracking.
159
177
@@ -166,4 +184,4 @@ For commercial licensing, production access, or integration partnerships:
166
184
167
185
## Benchmark Details
168
186
169
-
See [BENCHMARKS.md] for the full multidocument comparison methodology and per-document results.
187
+
See [BENCHMARKS.md](BENCHMARKS.md) for the current release benchmark methodology and numbers.
0 commit comments