Skip to content

ThonyAnt/Pacific-Coding-Challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Proof

Proof measures whether personal context actually improves LLM response quality and whether that quality holds after compression.

What it does

For each query in a benchmark file, Proof runs three context injection strategies and scores each response using a Claude-as-judge:

Strategy What's injected
none No context — bare system prompt
full Entire context file injected verbatim
compressed Context compressed to ~30% size by Claude Haiku, then injected

Each response is scored 1–5 on three dimensions:

  • Personalization — did the response draw on specific user details?
  • Specificity — are recommendations concrete and actionable?
  • Helpfulness — does it directly address the query?

The summary shows quality delta across strategies alongside TTFT and token cost — answering the core question: does compression preserve quality?

Quickstart

1. Clone and install

git clone https://github.com/ThonyAnt/Pacific-Coding-Challenge.git
cd Pacific-Coding-Challenge
pip install -r requirements.txt

2. Add your API key

cp .env.example .env
# open .env and paste your Anthropic API key

3. Run

Browser UI (recommended):

python -m streamlit run app.py

CLI:

python eval.py fixtures/example_context.md fixtures/example_benchmark.yaml

CLI options

Arguments:
  context_file    Personal context file (.md or .txt)
  benchmark_file  Benchmark queries (.yaml)

Options:
  --strategies    Comma-separated: none,full,compressed   [default: none,full,compressed]
  --model         Claude model for responses              [default: claude-haiku-4-5-20251001]
  --max-tokens    Max tokens per response                 [default: 400]
  --output        Save full results to JSON
  --show-responses  Print full response text per query
  --threshold     CI gate: exit 1 if best strategy score < threshold
  --baseline      Compare against saved JSON baseline, exit 1 on regression
  --regression-tolerance  Max allowed score drop vs baseline  [default: 0.25]

CI/CD integration

Proof ships with a GitHub Actions workflow that runs on every PR touching context logic:

# Save a baseline after a good run
python eval.py fixtures/ci_context.md fixtures/ci_benchmark.yaml --output baseline.json

# Gate on score threshold + regression check
python eval.py fixtures/ci_context.md fixtures/ci_benchmark.yaml \
  --threshold 3.5 \
  --baseline baseline.json

The workflow (.github/workflows/eval.yml) caches the baseline between runs and uploads results as artifacts. See the workflow file for full configuration.

Benchmark format

queries:
  - id: q1
    category: career
    query: "What projects should I focus on for my internship applications?"
  - id: q2
    category: technical
    query: "What should I learn next as an engineer?"

Real results

Scores from a 7-query run against a real personal context file:

Strategy Overall Personalization Specificity Helpfulness Avg TTFT
none 2.29 1.00 2.43 3.43 585ms
full 4.38 4.71 3.86 4.57 938ms
compressed 4.57 4.86 4.43 4.43 575ms

Compressed context outscored full while using 34% fewer tokens and running faster — suggesting compression forces higher-density signal rather than losing it.

File structure

.
├── eval.py                        # CLI entry point
├── app.py                         # Streamlit browser UI
├── src/
│   ├── strategies.py              # Context strategies (none / full / compressed)
│   ├── judge.py                   # Claude-as-judge scoring
│   └── report.py                  # Rich terminal output
├── fixtures/
│   ├── example_context.md         # Sample personal context
│   ├── example_benchmark.yaml     # Sample benchmark queries
│   ├── ci_context.md              # CI test fixture
│   └── ci_benchmark.yaml
├── docs/
│   └── onepager.pdf               # Project write-up
├── .github/workflows/eval.yml     # CI/CD integration
├── .env.example
└── requirements.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages