Proof

Proof measures whether personal context actually improves LLM response quality and whether that quality holds after compression.

What it does

For each query in a benchmark file, Proof runs three context injection strategies and scores each response using a Claude-as-judge:

Strategy	What's injected
`none`	No context — bare system prompt
`full`	Entire context file injected verbatim
`compressed`	Context compressed to ~30% size by Claude Haiku, then injected

Each response is scored 1–5 on three dimensions:

Personalization — did the response draw on specific user details?
Specificity — are recommendations concrete and actionable?
Helpfulness — does it directly address the query?

The summary shows quality delta across strategies alongside TTFT and token cost — answering the core question: does compression preserve quality?

Quickstart

1. Clone and install

git clone https://github.com/ThonyAnt/Pacific-Coding-Challenge.git
cd Pacific-Coding-Challenge
pip install -r requirements.txt

2. Add your API key

cp .env.example .env
# open .env and paste your Anthropic API key

3. Run

Browser UI (recommended):

python -m streamlit run app.py

CLI:

python eval.py fixtures/example_context.md fixtures/example_benchmark.yaml

CLI options

Arguments:
  context_file    Personal context file (.md or .txt)
  benchmark_file  Benchmark queries (.yaml)

Options:
  --strategies    Comma-separated: none,full,compressed   [default: none,full,compressed]
  --model         Claude model for responses              [default: claude-haiku-4-5-20251001]
  --max-tokens    Max tokens per response                 [default: 400]
  --output        Save full results to JSON
  --show-responses  Print full response text per query
  --threshold     CI gate: exit 1 if best strategy score < threshold
  --baseline      Compare against saved JSON baseline, exit 1 on regression
  --regression-tolerance  Max allowed score drop vs baseline  [default: 0.25]

CI/CD integration

Proof ships with a GitHub Actions workflow that runs on every PR touching context logic:

# Save a baseline after a good run
python eval.py fixtures/ci_context.md fixtures/ci_benchmark.yaml --output baseline.json

# Gate on score threshold + regression check
python eval.py fixtures/ci_context.md fixtures/ci_benchmark.yaml \
  --threshold 3.5 \
  --baseline baseline.json

The workflow (.github/workflows/eval.yml) caches the baseline between runs and uploads results as artifacts. See the workflow file for full configuration.

Benchmark format

queries:
  - id: q1
    category: career
    query: "What projects should I focus on for my internship applications?"
  - id: q2
    category: technical
    query: "What should I learn next as an engineer?"

Real results

Scores from a 7-query run against a real personal context file:

Strategy	Overall	Personalization	Specificity	Helpfulness	Avg TTFT
none	2.29	1.00	2.43	3.43	585ms
full	4.38	4.71	3.86	4.57	938ms
compressed	4.57	4.86	4.43	4.43	575ms

Compressed context outscored full while using 34% fewer tokens and running faster — suggesting compression forces higher-density signal rather than losing it.

File structure

.
├── eval.py                        # CLI entry point
├── app.py                         # Streamlit browser UI
├── src/
│   ├── strategies.py              # Context strategies (none / full / compressed)
│   ├── judge.py                   # Claude-as-judge scoring
│   └── report.py                  # Rich terminal output
├── fixtures/
│   ├── example_context.md         # Sample personal context
│   ├── example_benchmark.yaml     # Sample benchmark queries
│   ├── ci_context.md              # CI test fixture
│   └── ci_benchmark.yaml
├── docs/
│   └── onepager.pdf               # Project write-up
├── .github/workflows/eval.yml     # CI/CD integration
├── .env.example
└── requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proof

What it does

Quickstart

1. Clone and install

2. Add your API key

3. Run

CLI options

CI/CD integration

Benchmark format

Real results

File structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs		docs
fixtures		fixtures
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
eval.py		eval.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Proof

What it does

Quickstart

1. Clone and install

2. Add your API key

3. Run

CLI options

CI/CD integration

Benchmark format

Real results

File structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages