Skip to content

Commit 279ff9b

Browse files
New version v1.1.0
1 parent dbb7120 commit 279ff9b

6 files changed

Lines changed: 60 additions & 39 deletions

BENCHMARKS.md

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,35 @@
1-
# chonkify Benchmark Context vs LLMLingua and LLMLingua2
1+
# chonkify Benchmark Snapshot vs LLMLingua and LLMLingua2
22

3-
This file summarizes the most recent multidocument comparison from the internal document-compression benchmark suite. It is intended to be self-contained inside the minimal `chonkify` handoff folder.
3+
This handoff packages the current non-PDF release evidence for `chonkify`.
44

5-
## Suite Scope
5+
## Suite A: General `txt/md` Compression (`20` cases)
66

7-
- Documents: `5`
8-
- Budgets: `1500`, `1000`
9-
- Comparison metric for the percentage deltas below: mean `composite_info_recovery`
10-
- Best internal baseline on this suite: `cpc_mmr/key_sentence_mmr`
7+
| Method | fact_recall_mean | exact_success_rate | budget_ok_rate | mean_budget_overrun_tokens | weighted token savings |
8+
|---|---:|---:|---:|---:|---:|
9+
| `chonkify` | `0.8833` | `0.6500` | `1.0000` | `0.00` | `15.02%` |
10+
| `LLMLingua` | `1.0000` | `1.0000` | `0.0000` | `26.80` | `-77.40%` |
11+
| `LLMLingua2` | `0.8667` | `0.6500` | `0.3500` | `4.45` | `0.00%` |
1112

12-
## Composite Recovery Comparison
13+
Interpretation: `LLMLingua` v1 can keep more raw facts on the very smallest texts only by violating the requested token budget and, on aggregate, expanding the input. `chonkify` is the budget-valid line on this corridor.
1314

14-
| Budget | Internal best | LLMLingua | Internal delta vs LLMLingua | LLMLingua2 | Internal delta vs LLMLingua2 |
15-
|---|---:|---:|---:|---:|---:|
16-
| `1500` | `0.4302` | `0.2713` | `+58.61%` | `0.1559` | `+175.99%` |
17-
| `1000` | `0.3312` | `0.1804` | `+83.54%` | `0.1211` | `+173.49%` |
18-
| mean across both budgets | `0.3807` | `0.2258` | `+68.57%` | `0.1385` | `+174.90%` |
15+
## Suite B: Fact-Heavy Quant Research + Reasoning Traces (`22` cases)
1916

20-
## Win Pattern
17+
| Method | fact_recall_mean | exact_success_rate | budget_ok_rate | mean_budget_overrun_tokens | weighted token savings |
18+
|---|---:|---:|---:|---:|---:|
19+
| `chonkify` | `0.5606` | `0.2727` | `1.0000` | `0.00` | `78.40%` |
20+
| `LLMLingua` | `0.1061` | `0.0000` | `0.2727` | `30.86` | `70.41%` |
21+
| `LLMLingua2` | `0.1212` | `0.0000` | `0.1364` | `54.55` | `66.10%` |
2122

22-
- Budget `1500`: the internal `cpc_mmr/key_sentence_mmr` method wins `4/5` documents.
23-
- Budget `1000`: the internal `cpc_mmr/key_sentence_mmr` method wins `3/5` documents.
24-
- Across the `10` document-budget cells in this suite, internal methods win `9/10`.
23+
Interpretation: on fact-heavy corpora the current `chonkify` release is both smaller and higher quality than both `LLMLingua` variants.
2524

26-
## Important Caveat
25+
## Combined Token Savings
2726

28-
The current benchmark recovery scorer used for this comparison still relies on proxy metrics for sentence, heading, and numeric-fact recovery. These numbers are therefore useful as relative comparison evidence, but they are not a fully ground-truth semantic measure.
27+
| Method | source tokens | compressed tokens | weighted token savings |
28+
|---|---:|---:|---:|
29+
| `chonkify` | `12802` | `3175` | `75.20%` |
30+
| `LLMLingua` | `12802` | `4743` | `62.95%` |
31+
| `LLMLingua2` | `12802` | `4767` | `62.76%` |
2932

3033
## Positioning
3134

32-
`chonkify` now packages the actual internal winner path label from that comparison: `cpc_mmr/key_sentence_mmr`. The packaged runtime uses the same CPC/MMR selection logic and benchmark-style key-sentence preparation chain, while still keeping the benchmark scoring harness itself outside the product package.
35+
These two non-PDF benchmark corridors are the headline public evidence for this handoff and reflect the current `0.3.0` release line under hard budget constraints.

README.md

Lines changed: 37 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,49 +2,53 @@
22

33
**Extractive document compression that actually preserves what matters.**
44

5-
chonkify compresses long documents into tight, information-dense context — built for RAG pipelines, agent memory, and anywhere you need to fit more signal into fewer tokens. It uses a proprietary algorithm that consistently outperforms existing compression methods.
5+
chonkify compresses long documents into tight, information-dense context for RAG pipelines, agent memory, and any workflow where token budget matters as much as factual recovery. This release focuses on strong factual recovery under hard token budgets across general `txt`/`md` and fact-heavy document workloads.
66

7-
By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
7+
Today, the clearest validated fit is content-dense non-PDF text: quantitative research digests, structured engineering notes, and reasoning traces where downstream models need exact facts more than fluent paraphrase. It remains a general-purpose document compressor, but this is the workload family where the current release is strongest.
88

9-
![chonkify-logo](chonkify-logo.png)
9+
By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
1010

1111
---
1212

1313
## Why chonkify
1414

1515
Most compression tools optimize for token reduction. chonkify optimizes for **information recovery** — the compressed output retains the facts, structure, and reasoning that downstream models actually need.
1616

17-
In head-to-head multidocument benchmarks against Microsoft's LLMLingua family:
17+
On the current release corridors against Microsoft's LLMLingua family:
1818

19-
| Budget | chonkify | LLMLingua | LLMLingua2 |
19+
| Suite | chonkify | LLMLingua | LLMLingua2 |
2020
|---|---:|---:|---:|
21-
| 1500 tokens | **0.4302** | 0.2713 | 0.1559 |
22-
| 1000 tokens | **0.3312** | 0.1804 | 0.1211 |
21+
| general `txt/md` (`20` cases), `fact_recall_mean` | **0.8833** | 1.0000 | 0.8667 |
22+
| general `txt/md`, `budget_ok_rate` | **1.0000** | 0.0000 | 0.3500 |
23+
| fact-heavy quant/reasoning (`22` cases), `fact_recall_mean` | **0.5606** | 0.1061 | 0.1212 |
24+
| fact-heavy quant/reasoning, `budget_ok_rate` | **1.0000** | 0.2727 | 0.1364 |
2325

24-
That's **+69% composite information recovery** vs LLMLingua and **+175%** vs LLMLingua2 on average across both budgets, winning 9 out of 10 document-budget cells in the test suite. Full methodology in [BENCHMARKS.md](BENCHMARKS.md).
26+
Across both suites combined, `chonkify` currently saves **75.20%** of source tokens, versus **62.95%** for `LLMLingua` and **62.76%** for `LLMLingua2`. Full methodology and caveats are in [BENCHMARKS.md](BENCHMARKS.md).
2527

2628
## How It Works
2729

28-
chonkify embeds document content, scores passages by information density and diversity, and extracts the highest-value subset under your token budget. The selection core ships as compiled extension modules — the benchmarks speak for themselves.
30+
chonkify builds source-faithful document units, scores them through a strict `768`-dimensional embedding interface, and returns a compact output that respects your token budget. Performance-sensitive implementation ships as compiled extension modules.
2931

3032
## Install
3133

32-
chonkify ships as a compiled, platform-specific Python 3.11 wheel.
34+
This refreshed handoff includes the current native `cp311` wheel matrix for the supported desktop/server targets:
3335

3436
```bash
3537
# Linux x86_64
36-
pip install ./chonkify-0.2.2-cp311-cp311-manylinux*_x86_64.whl
38+
pip install ./chonkify-0.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
3739

38-
# macOS Apple Silicon
39-
pip install ./chonkify-0.2.2-cp311-cp311-macosx*_arm64.whl
40+
# Windows amd64
41+
py -3.11 -m pip install .\chonkify-0.3.0-cp311-cp311-win_amd64.whl
4042

41-
# macOS Intel
42-
pip install ./chonkify-0.2.2-cp311-cp311-macosx*_x86_64.whl
43+
# macOS arm64
44+
python3.11 -m pip install ./chonkify-0.3.0-cp311-cp311-macosx_11_0_arm64.whl
4345

44-
# Windows
45-
pip install .\chonkify-0.2.2-cp311-cp311-win_amd64.whl
46+
# macOS x86_64
47+
python3.11 -m pip install ./chonkify-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl
4648
```
4749

50+
These four wheels were produced by the native GitHub Actions matrix run `23559149680`, and the Linux manylinux artifact was revalidated afterwards with a fresh-venv `ci/wheel_smoke.py` install smoke.
51+
4852
For local CPU/GPU embeddings (no API calls), also install:
4953

5054
```bash
@@ -93,6 +97,20 @@ from chonkify import (
9397
)
9498
```
9599

100+
Minimal example:
101+
102+
```python
103+
from chonkify import compress_documents
104+
105+
result = compress_documents(
106+
["Quarterly revenue rose 18%. Operating margin expanded to 27%. Guidance remains unchanged."],
107+
target_tokens=24,
108+
)
109+
110+
print(result.compressed_text)
111+
print(result.compressed_tokens)
112+
```
113+
96114
## Embedding Backends
97115

98116
### Azure OpenAI (default)
@@ -153,7 +171,7 @@ The optional `--metadata-out` JSON includes:
153171
- Original and compressed token counts
154172
- Compression factor and token reduction percentage
155173
- Selected source blocks with source IDs and ranks
156-
- Embedding provider and selection strategy used
174+
- Embedding provider used
157175

158176
If you pass `--query`, it is preserved in metadata for provenance tracking.
159177

@@ -166,4 +184,4 @@ For commercial licensing, production access, or integration partnerships:
166184

167185
## Benchmark Details
168186

169-
See [BENCHMARKS.md] for the full multidocument comparison methodology and per-document results.
187+
See [BENCHMARKS.md](BENCHMARKS.md) for the current release benchmark methodology and numbers.
229 KB
Binary file not shown.
209 KB
Binary file not shown.
Binary file not shown.
195 KB
Binary file not shown.

0 commit comments

Comments
 (0)