thom-heinrich
diff --git a/‎BENCHMARKS.md‎
Lines changed: 23 additions & 20 deletions b/‎BENCHMARKS.md‎
Lines changed: 23 additions & 20 deletions
diff --git a/‎README.md‎
Lines changed: 37 additions & 19 deletions b/‎README.md‎
Lines changed: 37 additions & 19 deletions
diff --git a/‎chonkify-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl‎
229 KB b/‎chonkify-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl‎
229 KB
diff --git a/‎chonkify-0.3.0-cp311-cp311-macosx_11_0_arm64.whl‎
209 KB b/‎chonkify-0.3.0-cp311-cp311-macosx_11_0_arm64.whl‎
209 KB
diff --git a/‎chonkify-0.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl‎
1.29 MB b/‎chonkify-0.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl‎
1.29 MB
diff --git a/‎chonkify-0.3.0-cp311-cp311-win_amd64.whl‎
195 KB b/‎chonkify-0.3.0-cp311-cp311-win_amd64.whl‎
195 KB
@@ -1,32 +1,35 @@
-# chonkify Benchmark Context vs LLMLingua and LLMLingua2
+# chonkify Benchmark Snapshot vs LLMLingua and LLMLingua2
 
-This file summarizes the most recent multidocument comparison from the internal document-compression benchmark suite. It is intended to be self-contained inside the minimal `chonkify` handoff folder.
+This handoff packages the current non-PDF release evidence for `chonkify`.
 
-## Suite Scope
+## Suite A: General `txt/md` Compression (`20` cases)
 
-- Documents: `5`
-- Budgets: `1500`, `1000`
-- Comparison metric for the percentage deltas below: mean `composite_info_recovery`
-- Best internal baseline on this suite: `cpc_mmr/key_sentence_mmr`
+| Method | fact_recall_mean | exact_success_rate | budget_ok_rate | mean_budget_overrun_tokens | weighted token savings |
+|---|---:|---:|---:|---:|---:|
+| `chonkify` | `0.8833` | `0.6500` | `1.0000` | `0.00` | `15.02%` |
+| `LLMLingua` | `1.0000` | `1.0000` | `0.0000` | `26.80` | `-77.40%` |
+| `LLMLingua2` | `0.8667` | `0.6500` | `0.3500` | `4.45` | `0.00%` |
 
-## Composite Recovery Comparison
+Interpretation: `LLMLingua` v1 can keep more raw facts on the very smallest texts only by violating the requested token budget and, on aggregate, expanding the input. `chonkify` is the budget-valid line on this corridor.
 
-| Budget | Internal best | LLMLingua | Internal delta vs LLMLingua | LLMLingua2 | Internal delta vs LLMLingua2 |
-|---|---:|---:|---:|---:|---:|
-| `1500` | `0.4302` | `0.2713` | `+58.61%` | `0.1559` | `+175.99%` |
-| `1000` | `0.3312` | `0.1804` | `+83.54%` | `0.1211` | `+173.49%` |
-| mean across both budgets | `0.3807` | `0.2258` | `+68.57%` | `0.1385` | `+174.90%` |
+## Suite B: Fact-Heavy Quant Research + Reasoning Traces (`22` cases)
 
-## Win Pattern
+| Method | fact_recall_mean | exact_success_rate | budget_ok_rate | mean_budget_overrun_tokens | weighted token savings |
+|---|---:|---:|---:|---:|---:|
+| `chonkify` | `0.5606` | `0.2727` | `1.0000` | `0.00` | `78.40%` |
+| `LLMLingua` | `0.1061` | `0.0000` | `0.2727` | `30.86` | `70.41%` |
+| `LLMLingua2` | `0.1212` | `0.0000` | `0.1364` | `54.55` | `66.10%` |
 
-- Budget `1500`: the internal `cpc_mmr/key_sentence_mmr` method wins `4/5` documents.
-- Budget `1000`: the internal `cpc_mmr/key_sentence_mmr` method wins `3/5` documents.
-- Across the `10` document-budget cells in this suite, internal methods win `9/10`.
+Interpretation: on fact-heavy corpora the current `chonkify` release is both smaller and higher quality than both `LLMLingua` variants.
 
-## Important Caveat
+## Combined Token Savings
 
-The current benchmark recovery scorer used for this comparison still relies on proxy metrics for sentence, heading, and numeric-fact recovery. These numbers are therefore useful as relative comparison evidence, but they are not a fully ground-truth semantic measure.
+| Method | source tokens | compressed tokens | weighted token savings |
+|---|---:|---:|---:|
+| `chonkify` | `12802` | `3175` | `75.20%` |
+| `LLMLingua` | `12802` | `4743` | `62.95%` |
+| `LLMLingua2` | `12802` | `4767` | `62.76%` |
 
 ## Positioning
 
-`chonkify` now packages the actual internal winner path label from that comparison: `cpc_mmr/key_sentence_mmr`. The packaged runtime uses the same CPC/MMR selection logic and benchmark-style key-sentence preparation chain, while still keeping the benchmark scoring harness itself outside the product package.
+These two non-PDF benchmark corridors are the headline public evidence for this handoff and reflect the current `0.3.0` release line under hard budget constraints.
@@ -2,49 +2,53 @@
 
 **Extractive document compression that actually preserves what matters.**
 
-chonkify compresses long documents into tight, information-dense context — built for RAG pipelines, agent memory, and anywhere you need to fit more signal into fewer tokens. It uses a proprietary algorithm that consistently outperforms existing compression methods.
+chonkify compresses long documents into tight, information-dense context for RAG pipelines, agent memory, and any workflow where token budget matters as much as factual recovery. This release focuses on strong factual recovery under hard token budgets across general `txt`/`md` and fact-heavy document workloads.
 
-By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
+Today, the clearest validated fit is content-dense non-PDF text: quantitative research digests, structured engineering notes, and reasoning traces where downstream models need exact facts more than fluent paraphrase. It remains a general-purpose document compressor, but this is the workload family where the current release is strongest.
 
-![chonkify-logo](chonkify-logo.png)
+By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
 
 ---
 
 ## Why chonkify
 
 Most compression tools optimize for token reduction. chonkify optimizes for **information recovery** — the compressed output retains the facts, structure, and reasoning that downstream models actually need.
 
-In head-to-head multidocument benchmarks against Microsoft's LLMLingua family:
+On the current release corridors against Microsoft's LLMLingua family:
 
-| Budget | chonkify | LLMLingua | LLMLingua2 |
+| Suite | chonkify | LLMLingua | LLMLingua2 |
 |---|---:|---:|---:|
-| 1500 tokens | **0.4302** | 0.2713 | 0.1559 |
-| 1000 tokens | **0.3312** | 0.1804 | 0.1211 |
+| general `txt/md` (`20` cases), `fact_recall_mean` | **0.8833** | 1.0000 | 0.8667 |
+| general `txt/md`, `budget_ok_rate` | **1.0000** | 0.0000 | 0.3500 |
+| fact-heavy quant/reasoning (`22` cases), `fact_recall_mean` | **0.5606** | 0.1061 | 0.1212 |
+| fact-heavy quant/reasoning, `budget_ok_rate` | **1.0000** | 0.2727 | 0.1364 |
 
-That's **+69% composite information recovery** vs LLMLingua and **+175%** vs LLMLingua2 on average across both budgets, winning 9 out of 10 document-budget cells in the test suite. Full methodology in [BENCHMARKS.md](BENCHMARKS.md).
+Across both suites combined, `chonkify` currently saves **75.20%** of source tokens, versus **62.95%** for `LLMLingua` and **62.76%** for `LLMLingua2`. Full methodology and caveats are in [BENCHMARKS.md](BENCHMARKS.md).
 
 ## How It Works
 
-chonkify embeds document content, scores passages by information density and diversity, and extracts the highest-value subset under your token budget. The selection core ships as compiled extension modules — the benchmarks speak for themselves.
+chonkify builds source-faithful document units, scores them through a strict `768`-dimensional embedding interface, and returns a compact output that respects your token budget. Performance-sensitive implementation ships as compiled extension modules.
 
 ## Install
 
-chonkify ships as a compiled, platform-specific Python 3.11 wheel.
+This refreshed handoff includes the current native `cp311` wheel matrix for the supported desktop/server targets:
 
 ```bash
 # Linux x86_64
-pip install ./chonkify-0.2.2-cp311-cp311-manylinux*_x86_64.whl
+pip install ./chonkify-0.3.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
 
-# macOS Apple Silicon
-pip install ./chonkify-0.2.2-cp311-cp311-macosx*_arm64.whl
+# Windows amd64
+py -3.11 -m pip install .\chonkify-0.3.0-cp311-cp311-win_amd64.whl
 
-# macOS Intel
-pip install ./chonkify-0.2.2-cp311-cp311-macosx*_x86_64.whl
+# macOS arm64
+python3.11 -m pip install ./chonkify-0.3.0-cp311-cp311-macosx_11_0_arm64.whl
 
-# Windows
-pip install .\chonkify-0.2.2-cp311-cp311-win_amd64.whl
+# macOS x86_64
+python3.11 -m pip install ./chonkify-0.3.0-cp311-cp311-macosx_10_9_x86_64.whl
 ```
 
+These four wheels were produced by the native GitHub Actions matrix run `23559149680`, and the Linux manylinux artifact was revalidated afterwards with a fresh-venv `ci/wheel_smoke.py` install smoke.
+
 For local CPU/GPU embeddings (no API calls), also install:
 
 ```bash
@@ -93,6 +97,20 @@ from chonkify import (
 )
 ```
 
+Minimal example:
+
+```python
+from chonkify import compress_documents
+
+result = compress_documents(
+    ["Quarterly revenue rose 18%. Operating margin expanded to 27%. Guidance remains unchanged."],
+    target_tokens=24,
+)
+
+print(result.compressed_text)
+print(result.compressed_tokens)
+```
+
 ## Embedding Backends
 
 ### Azure OpenAI (default)
@@ -153,7 +171,7 @@ The optional `--metadata-out` JSON includes:
 - Original and compressed token counts
 - Compression factor and token reduction percentage
 - Selected source blocks with source IDs and ranks
-- Embedding provider and selection strategy used
+- Embedding provider used
 
 If you pass `--query`, it is preserved in metadata for provenance tracking.
 
@@ -166,4 +184,4 @@ For commercial licensing, production access, or integration partnerships:
 
 ## Benchmark Details
 
-See [BENCHMARKS.md] for the full multidocument comparison methodology and per-document results.
+See [BENCHMARKS.md](BENCHMARKS.md) for the current release benchmark methodology and numbers.