thom-heinrich
diff --git a/‎BENCHMARKS.md‎
Lines changed: 32 additions & 0 deletions b/‎BENCHMARKS.md‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎LICENSE.md‎
Lines changed: 52 additions & 0 deletions b/‎LICENSE.md‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 169 additions & 0 deletions b/‎README.md‎
Lines changed: 169 additions & 0 deletions
diff --git a/‎chonkify-0.2.2-cp311-cp311-macosx_10_9_x86_64.whl‎
201 KB b/‎chonkify-0.2.2-cp311-cp311-macosx_10_9_x86_64.whl‎
201 KB
diff --git a/‎chonkify-0.2.2-cp311-cp311-macosx_11_0_arm64.whl‎
181 KB b/‎chonkify-0.2.2-cp311-cp311-macosx_11_0_arm64.whl‎
181 KB
diff --git a/‎chonkify-0.2.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl‎
1.26 MB b/‎chonkify-0.2.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl‎
1.26 MB
diff --git a/‎chonkify-0.2.2-cp311-cp311-win_amd64.whl‎
167 KB b/‎chonkify-0.2.2-cp311-cp311-win_amd64.whl‎
167 KB
diff --git a/‎chonkify.png‎
428 KB b/‎chonkify.png‎
428 KB
@@ -0,0 +1,32 @@
+# chonkify Benchmark Context vs LLMLingua and LLMLingua2
+
+This file summarizes the most recent multidocument comparison from the internal document-compression benchmark suite. It is intended to be self-contained inside the minimal `chonkify` handoff folder.
+
+## Suite Scope
+
+- Documents: `5`
+- Budgets: `1500`, `1000`
+- Comparison metric for the percentage deltas below: mean `composite_info_recovery`
+- Best internal baseline on this suite: `cpc_mmr/key_sentence_mmr`
+
+## Composite Recovery Comparison
+
+| Budget | Internal best | LLMLingua | Internal delta vs LLMLingua | LLMLingua2 | Internal delta vs LLMLingua2 |
+|---|---:|---:|---:|---:|---:|
+| `1500` | `0.4302` | `0.2713` | `+58.61%` | `0.1559` | `+175.99%` |
+| `1000` | `0.3312` | `0.1804` | `+83.54%` | `0.1211` | `+173.49%` |
+| mean across both budgets | `0.3807` | `0.2258` | `+68.57%` | `0.1385` | `+174.90%` |
+
+## Win Pattern
+
+- Budget `1500`: the internal `cpc_mmr/key_sentence_mmr` method wins `4/5` documents.
+- Budget `1000`: the internal `cpc_mmr/key_sentence_mmr` method wins `3/5` documents.
+- Across the `10` document-budget cells in this suite, internal methods win `9/10`.
+
+## Important Caveat
+
+The current benchmark recovery scorer used for this comparison still relies on proxy metrics for sentence, heading, and numeric-fact recovery. These numbers are therefore useful as relative comparison evidence, but they are not a fully ground-truth semantic measure.
+
+## Positioning
+
+`chonkify` now packages the actual internal winner path label from that comparison: `cpc_mmr/key_sentence_mmr`. The packaged runtime uses the same CPC/MMR selection logic and benchmark-style key-sentence preparation chain, while still keeping the benchmark scoring harness itself outside the product package.
@@ -0,0 +1,52 @@
+# Chonkify Evaluation-Only Proprietary License
+
+Copyright (c) Thomas "Thom" Heinrich 2026  
+Contact: th@chonkydb.com
+
+All rights reserved.
+
+## 1. Limited Grant
+
+Subject to full compliance with this license, you are granted a revocable, non-exclusive, non-transferable, non-sublicensable right to install and use `chonkify` solely for internal, non-production, non-commercial evaluation, testing, and review purposes.
+
+## 2. No Ownership Transfer
+
+`chonkify`, including all source code, documentation, interfaces, algorithms, packaging, and related materials, is licensed and not sold. No ownership rights are transferred.
+
+## 3. Strict Prohibitions
+
+Without prior written permission from Thomas "Thom" Heinrich, you may not:
+
+- use the software in production or for any revenue-generating activity;
+- distribute, share, publish, sublicense, lease, lend, sell, or otherwise provide the software to any third party;
+- modify, adapt, translate, create derivative works from, or merge the software into other software;
+- reverse engineer, decompile, disassemble, or otherwise attempt to derive non-public implementation details, except where such restriction is prohibited by mandatory law;
+- remove or obscure copyright, attribution, branding, or license notices;
+- use the software to build or improve a competing product or service.
+
+## 4. Compliance and Scope
+
+Any use outside the narrow evaluation scope in Section 1 is unlicensed and prohibited. This includes hosted use, customer delivery, internal production pilots, and operational deployment.
+
+## 5. Termination
+
+This license terminates automatically and immediately if you breach any term. Upon termination, you must stop using the software and destroy all copies in your possession or control.
+
+## 6. Warranty Disclaimer
+
+THE SOFTWARE IS PROVIDED "AS IS" AND "AS AVAILABLE", WITHOUT WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED, OR STATUTORY, INCLUDING WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE, NON-INFRINGEMENT, OR PERFORMANCE.
+
+## 7. Liability Limitation
+
+TO THE MAXIMUM EXTENT PERMITTED BY LAW, THOMAS "THOM" HEINRICH SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, CONSEQUENTIAL, OR OTHER DAMAGES ARISING FROM OR RELATING TO THE SOFTWARE OR THIS LICENSE.
+
+## 8. Reservation of Rights
+
+All rights not expressly granted are reserved by Thomas "Thom" Heinrich.
+
+## 9. Contact
+
+Licensing, commercial access, and other permissions require prior written approval:
+
+- Thomas "Thom" Heinrich
+- th@chonkydb.com
@@ -0,0 +1,169 @@
+# chonkify
+
+**Extractive document compression that actually preserves what matters.**
+
+chonkify compresses long documents into tight, information-dense context — built for RAG pipelines, agent memory, and anywhere you need to fit more signal into fewer tokens. It uses a proprietary algorithm that consistently outperforms existing compression methods.
+
+By [Thomas "Thom" Heinrich](mailto:th@thomheinrich.de) · [chonkyDB.com](https://chonkydb.com)
+
+![chonkify-logo](chonkify.png)
+
+---
+
+## Why chonkify
+
+Most compression tools optimize for token reduction. chonkify optimizes for **information recovery** — the compressed output retains the facts, structure, and reasoning that downstream models actually need.
+
+In head-to-head multidocument benchmarks against Microsoft's LLMLingua family:
+
+| Budget | chonkify | LLMLingua | LLMLingua2 |
+|---|---:|---:|---:|
+| 1500 tokens | **0.4302** | 0.2713 | 0.1559 |
+| 1000 tokens | **0.3312** | 0.1804 | 0.1211 |
+
+That's **+69% composite information recovery** vs LLMLingua and **+175%** vs LLMLingua2 on average across both budgets, winning 9 out of 10 document-budget cells in the test suite. Full methodology in [RESULTS_vs_LLMLingua.md](RESULTS_vs_LLMLingua.md).
+
+## How It Works
+
+chonkify embeds document content, scores passages by information density and diversity, and extracts the highest-value subset under your token budget. The selection core ships as compiled extension modules — the benchmarks speak for themselves.
+
+## Install
+
+chonkify ships as a compiled, platform-specific Python 3.11 wheel.
+
+```bash
+# Linux x86_64
+pip install ./chonkify-0.2.2-cp311-cp311-manylinux*_x86_64.whl
+
+# macOS Apple Silicon
+pip install ./chonkify-0.2.2-cp311-cp311-macosx*_arm64.whl
+
+# macOS Intel
+pip install ./chonkify-0.2.2-cp311-cp311-macosx*_x86_64.whl
+
+# Windows
+pip install .\chonkify-0.2.2-cp311-cp311-win_amd64.whl
+```
+
+For local CPU/GPU embeddings (no API calls), also install:
+
+```bash
+pip install sentence-transformers
+```
+
+Or use the optional extra: `pip install chonkify[local]`
+
+## Quick Start
+
+### CLI
+
+```bash
+chonkify compress ./paper.pdf \
+  --target-tokens 1200 \
+  --output ./paper_compressed.txt \
+  --metadata-out ./paper_meta.json
+```
+
+Multiple documents in one pass:
+
+```bash
+chonkify compress ./brief.md ./appendix.pdf \
+  --target-tokens 1400 \
+  --output ./bundle.txt
+```
+
+Pipe from stdin:
+
+```bash
+cat ./notes.txt | chonkify compress - --target-tokens 900 --output -
+```
+
+### Python API
+
+```python
+from chonkify import compress_documents
+
+# With additional control over embedding providers:
+from chonkify import (
+    LocalEmbeddingConfig,
+    LocalSentenceTransformerEmbeddingProvider,
+    OpenAIEmbeddingConfig,
+    OpenAIEmbeddingProvider,
+    compress_documents,
+)
+```
+
+## Embedding Backends
+
+### Azure OpenAI (default)
+
+```bash
+export AZURE_OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com/"
+export AZURE_OPENAI_API_KEY="<secret>"
+export AZURE_OPENAI_API_VERSION="2024-10-21"
+export CHONKIFY_AZURE_EMBEDDING_DEPLOYMENT="<deployment-name>"
+```
+
+### OpenAI
+
+```bash
+export OPENAI_API_KEY="<secret>"
+export CHONKIFY_OPENAI_EMBEDDING_MODEL="text-embedding-3-large"
+```
+
+```bash
+chonkify compress ./paper.pdf --embedding-backend openai --target-tokens 1200
+```
+
+### OpenAI-Compatible Endpoints
+
+For providers like Together, Fireworks, or self-hosted APIs:
+
+```bash
+export OPENAI_API_KEY="<key>"
+export CHONKIFY_OPENAI_BASE_URL="https://<provider>/v1"
+export CHONKIFY_OPENAI_EMBEDDING_MODEL="<model-id>"
+```
+
+```bash
+chonkify compress ./paper.pdf --embedding-backend openai-compatible --target-tokens 1200
+```
+
+If your endpoint rejects the `dimensions` parameter, add `--openai-omit-dimensions-parameter`. chonkify still validates 768-dimensional output.
+
+### Local (SentenceTransformers)
+
+Fully offline after first model download. Default model: `sentence-transformers/all-mpnet-base-v2`.
+
+```bash
+chonkify compress ./paper.pdf \
+  --embedding-backend local \
+  --local-device cuda \
+  --target-tokens 1200
+```
+
+Device options: `cpu`, `cuda`, `cuda:0`, `mps`.
+
+Validated with `sentence-transformers 5.1.0` and `torch 2.8.0+cu128` on NVIDIA RTX 3090. Cold-cache run: ~13s. Warm-cache run: ~6s. Model footprint: ~419 MB. With `HF_HUB_OFFLINE=1`, the local backend runs fully air-gapped once cached.
+
+## Output Metadata
+
+The optional `--metadata-out` JSON includes:
+
+- Original and compressed token counts
+- Compression factor and token reduction percentage
+- Selected source blocks with source IDs and ranks
+- Embedding provider and selection strategy used
+
+If you pass `--query`, it is preserved in metadata for provenance tracking.
+
+## License
+
+chonkify is proprietary software. The current release is licensed for **evaluation, testing, and review only** — not for production use. See [LICENSE.md](LICENSE.md) for full terms.
+
+For commercial licensing, production access, or integration partnerships:
+**th@chonkydb.com**
+
+## Benchmark Details
+
+See [BENCHMARKS.md) for the full multidocument comparison methodology and per-document results.