Proof measures whether personal context actually improves LLM response quality and whether that quality holds after compression.
For each query in a benchmark file, Proof runs three context injection strategies and scores each response using a Claude-as-judge:
| Strategy | What's injected |
|---|---|
none |
No context — bare system prompt |
full |
Entire context file injected verbatim |
compressed |
Context compressed to ~30% size by Claude Haiku, then injected |
Each response is scored 1–5 on three dimensions:
- Personalization — did the response draw on specific user details?
- Specificity — are recommendations concrete and actionable?
- Helpfulness — does it directly address the query?
The summary shows quality delta across strategies alongside TTFT and token cost — answering the core question: does compression preserve quality?
git clone https://github.com/ThonyAnt/Pacific-Coding-Challenge.git
cd Pacific-Coding-Challenge
pip install -r requirements.txtcp .env.example .env
# open .env and paste your Anthropic API keyBrowser UI (recommended):
python -m streamlit run app.pyCLI:
python eval.py fixtures/example_context.md fixtures/example_benchmark.yamlArguments:
context_file Personal context file (.md or .txt)
benchmark_file Benchmark queries (.yaml)
Options:
--strategies Comma-separated: none,full,compressed [default: none,full,compressed]
--model Claude model for responses [default: claude-haiku-4-5-20251001]
--max-tokens Max tokens per response [default: 400]
--output Save full results to JSON
--show-responses Print full response text per query
--threshold CI gate: exit 1 if best strategy score < threshold
--baseline Compare against saved JSON baseline, exit 1 on regression
--regression-tolerance Max allowed score drop vs baseline [default: 0.25]
Proof ships with a GitHub Actions workflow that runs on every PR touching context logic:
# Save a baseline after a good run
python eval.py fixtures/ci_context.md fixtures/ci_benchmark.yaml --output baseline.json
# Gate on score threshold + regression check
python eval.py fixtures/ci_context.md fixtures/ci_benchmark.yaml \
--threshold 3.5 \
--baseline baseline.jsonThe workflow (.github/workflows/eval.yml) caches the baseline between runs and uploads results as artifacts. See the workflow file for full configuration.
queries:
- id: q1
category: career
query: "What projects should I focus on for my internship applications?"
- id: q2
category: technical
query: "What should I learn next as an engineer?"Scores from a 7-query run against a real personal context file:
| Strategy | Overall | Personalization | Specificity | Helpfulness | Avg TTFT |
|---|---|---|---|---|---|
| none | 2.29 | 1.00 | 2.43 | 3.43 | 585ms |
| full | 4.38 | 4.71 | 3.86 | 4.57 | 938ms |
| compressed | 4.57 | 4.86 | 4.43 | 4.43 | 575ms |
Compressed context outscored full while using 34% fewer tokens and running faster — suggesting compression forces higher-density signal rather than losing it.
.
├── eval.py # CLI entry point
├── app.py # Streamlit browser UI
├── src/
│ ├── strategies.py # Context strategies (none / full / compressed)
│ ├── judge.py # Claude-as-judge scoring
│ └── report.py # Rich terminal output
├── fixtures/
│ ├── example_context.md # Sample personal context
│ ├── example_benchmark.yaml # Sample benchmark queries
│ ├── ci_context.md # CI test fixture
│ └── ci_benchmark.yaml
├── docs/
│ └── onepager.pdf # Project write-up
├── .github/workflows/eval.yml # CI/CD integration
├── .env.example
└── requirements.txt