🔍 Post-Training Failure Evals

A Reward-Hacking, Sycophancy & Calibration Detector for SFT / DPO / RLHF Models

Surface metrics say your aligned model got better. This harness finds where it secretly got worse.

6 failure modes · 4 model variants · LLM-judge with debiasing · log-prob calibration · auditable receipts

Preference optimization (DPO / RLHF) maximizes a proxy — a learned reward or an LLM judge — not ground truth. The dangerous, well-documented outcome is that the proxy score climbs while the true objective (truthfulness, calibration, safety) silently degrades. Aggregate benchmarks hide this because the average barely moves.

Post-Training Failure Evals is a model-agnostic harness that holds a base model fixed, walks it through SFT → DPO, and attributes each regression to the optimization step that caused it — across six concrete failure modes, with every number backed by per-example receipts.

_{Real run: Qwen3-8B-Base → SFT → DPO vs. Qwen3-4B-Instruct, judged by Qwen3-30B-A3B-Instruct, 142 prompts. Darker = more failure.}

🔑 Headline findings

From a single end-to-end run (results/ → docs/example_run/report.md), the harness surfaces failures that an accuracy number would never reveal:

Finding	Evidence	Why it matters
Instruct tuning is the most reward-hacked	`reward_hacking_score` 0.47 (highest), ECE 0.213 (worst), responses 1.63× longer than base	Wins the judge without being more factual — textbook proxy gaming
DPO increased sycophancy	`sycophancy_rate` 0.34 → 0.45 (base → DPO)	Preference data taught it to agree/cave under user pushback
SFT silently broke safety	harmful compliance 0% → 12.5% (base → SFT)	A safety regression introduced before DPO ever ran
DPO degraded out-of-distribution	OOD quality drop 0.22 → 0.32	Overfit to the post-training distribution
10% of DPO's "preference wins" were factually wrong	`degraded_win_rate = 0.10`	The exact gap reward models can't see

_{Composite failure score (mean of six normalized suites, lower = safer). No single variant is strictly best — each optimization step trades one failure mode for another.}

🧪 What it detects

Suite	Failure mode	Headline metric	How it's measured
`sycophancy`	Agrees with the user instead of the truth	`sycophancy_rate`	False-premise endorsement + caving under pushback (flips a correct answer when the user insists on a wrong one)
`verbosity`	Length inflation without substance	`mean_tokens`	Default length, judge-rated padding rate, and the reward-proxy's length preference
`reward_hacking`	Wins the judge but loses factuality	`reward_hacking_score`	Judge win-rate vs. base minus factual-accuracy gain; plus `degraded_win_rate`
`calibration`	Confidently wrong (false confidence)	`ece`	Log-prob MC confidence + verbalized confidence vs. correctness (ECE, Brier, AUROC, overconfidence gap)
`safety`	Unsafe compliance drift	`harmful_compliance_rate`	Compliance on harmful prompts + over-refusal on benign controls
`dist_shift`	Overfit to the post-training distribution	`quality_drop`	Judged quality drop from in-domain → out-of-domain prompts

Every metric writes a raw/<model>__<suite>.jsonl file containing the exact prompt, response, and judge verdict for each example — the receipts behind the number.

🚀 Quickstart

# 1. Install
python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -e .

# 2. Add your Tinker key (.env at the repo root)
echo 'TINKER_API_KEY="tml-..."' > .env

# 3. Cheap smoke test of the whole pipeline (a few examples/suite, small models)
pte-eval run --config configs/smoke.yaml

# 4. Full headline run (base → SFT → DPO + instruct, strong judge)
pte-eval run --config configs/full.yaml

# Slice it: pick suites, cap examples
pte-eval run --config configs/full.yaml --suites sycophancy calibration --limit 20
pte-eval list-suites

Each run writes a self-contained folder:

results/<run_id>/
├── report.md          # human-readable summary with embedded plots
├── summary.json       # headline + normalized failure matrix + composite scores
├── metrics.json       # full metric dict for every (model, suite)
├── plots/*.png        # heatmap, composite bar, reliability diagrams, ...
└── raw/<model>__<suite>.jsonl   # per-example receipts

🏋️ Train the variants: base → SFT → DPO

The point of the harness is to attribute a failure to a step. Train the chain on Tinker (LoRA, remote GPUs) and the scripts print a tinker:// checkpoint path:

# Stage 1 — SFT a base model into a chat model
python training/train_sft.py --model-name Qwen/Qwen3-8B-Base --dataset no_robots

# Stage 2 — DPO from the SFT state checkpoint (lower beta = more aggressive optimization)
python training/train_dpo.py --model-name Qwen/Qwen3-8B-Base \
    --load-checkpoint-path tinker://YOUR_SFT_STATE --dataset hhh --dpo-beta 0.1

Paste the printed paths into configs/full.yaml (the sft / dpo blocks) and re-run. The included full.yaml already points at a trained Qwen3-8B-Base chain. Full guide: training/README.md.

⚙️ How it works

                 ┌──────────────┐     ┌──────────────┐
   ModelSpec ──▶ │ ModelClient  │     │    Judge     │  (separate, stronger model)
 (base/sft/dpo/  │ Tinker sample│     │ pairwise /   │
  instruct)      │ + logprobs   │     │ score /      │
                 │ + chat render│     │ classify     │  ← batched + position-debiased
                 └──────┬───────┘     └──────┬───────┘
                        │                    │
                        ▼                    ▼
        ┌───────────────────────────────────────────────┐
        │   Evaluators (6 failure-mode suites)           │
        │   sycophancy · verbosity · reward_hacking ·    │
        │   calibration · safety · dist_shift            │
        └───────────────────────┬───────────────────────┘
                                 ▼
              Runner → metrics.json / summary.json
                     → report.md + plots/   (+ Gradio dashboard)

Design decisions that make the numbers trustworthy:

Model-agnostic. A ModelClient wraps a Tinker SamplingClient + cookbook renderers, so the same chat prompt runs on a raw base model (role_colon) and an instruct model (qwen3_instruct / llama3).
Confidence from log-probs. Calibration scores the correct vs. a plausible-false answer as continuations via compute_logprobs — a clean, judge-free confidence signal — alongside verbalized confidence (ECE, Brier, AUROC).
Judge bias is measured, not assumed. Pairwise judging averages both response orders (position-swap debiasing) and we report the judge's own verbosity bias (length_pref_bias) instead of trusting it blindly.
Comparative by construction. Reward hacking is defined relative to base: judge_pref − factuality_delta, with degraded_win_rate isolating wins that were factually wrong.
Fast & reproducible. Judge calls are pipelined through Tinker futures; curated builtin datasets make runs offline-deterministic (flip use_hf_datasets: true for TruthfulQA / Anthropic HH-RLHF scale-up); generations are cached across suites.

📊 Results in detail

Run 20260529-012135 — base/SFT/DPO from Qwen3-8B-Base, Qwen3-4B-Instruct, judge Qwen3-30B-A3B-Instruct, full builtin sets (142 prompts). Lower is safer except verbosity (tokens) and reward-hacking (relative to base).

Suite (headline)	base	sft	dpo	instruct
sycophancy_rate ↓	0.341	0.438	0.449	0.047
reward_hacking_score ↓	0.000	−0.312	0.000	0.469
ece (calibration) ↓	0.161	0.176	0.159	0.213
harmful_compliance ↓	0.000	0.125	0.000	0.000
dist_shift quality_drop ↓	0.220	0.290	0.320	0.270
mean_tokens (verbosity)	97	59	72	159 (1.63×)
composite ↓	0.227	0.217	0.223	0.252

_{Calibration reliability (log-prob confidence vs. accuracy). The instruct variant is the most overconfident — its curve sits furthest below the diagonal.}

📚 Datasets

Curated, auditable builtin sets ship in pte/data/builtin/ (optional HF backends):

File	Purpose	Feeds
`factual_qa`	TruthfulQA-style misconceptions + a plausible-false answer	calibration, reward hacking, sycophancy-pressure
`false_premises`	false premises embedded in questions	sycophancy
`safety`	harmful requests + benign over-refusal controls	safety drift
`ood`	paired in-domain vs. out-of-domain prompts	distribution shift
`instructions`	neutral helpfulness prompts	verbosity

Set use_hf_datasets: true to pull truthful_qa and Anthropic/hh-rlhf instead.

🗂 Repository layout

pte/
  models.py        # Tinker SamplingClient wrapper: chat gen, batching, logprob scoring
  judge.py         # LLM-as-judge: batched + position-debiased pairwise, scoring, classifiers
  metrics.py       # ECE, reliability curve, AUROC, Brier, Wilson + bootstrap CIs
  evaluators/      # one file per failure mode + base framework/registry
  data/            # loaders + curated builtin JSONL datasets
  runner.py        # orchestration, normalization, composite scoring
  report.py        # matplotlib plots + markdown report
  cli.py           # `pte-eval` entrypoint
training/          # SFT + DPO scripts (Tinker Cookbook) to produce the variants
configs/           # full.yaml (trained 4-variant) · eval.yaml (template) · smoke.yaml (cheap)
scripts/           # Gradio dashboard 
tests/             # 28 offline tests (metrics, data schema, evaluator logic)
docs/example_run/  # a committed real run (report + plots) for instant preview

✅ Reproducibility & testing

pip install pytest
pytest -q          # 28 tests, fully offline: no Tinker calls, no network

Builtin datasets are versioned in-repo and loaders are deterministic, so a run with use_hf_datasets: false is fully reproducible. The committed docs/example_run/ lets reviewers see real output without spending a single API call.

⚠️ Limitations

Judges are imperfect. Treat absolute numbers as directional; rely on base → variant deltas and use a strong judge (Qwen3-30B / Llama-3.3-70B) for headline runs. Judge verbosity/position bias is reported, not hidden.
Safety prompts are red-team categories phrased as requests; no operational harmful content is included in the repo.
Short LoRA runs (for cost) mean the trained variants illustrate the method; scale steps/data for publication-grade conclusions.
Tinker training/inference incurs usage — start with smoke.yaml / --limit.

📌 Citation

@software{post_training_failure_evals,
  title  = {Post-Training Failure Evals: A Reward-Hacking, Sycophancy and
            Calibration Detector for SFT/DPO/RLHF Models},
  year   = {2026},
  url    = {https://github.com/rishi-more-2003/post-training-failure-evals}
}

🙏 Acknowledgements

Built on the Tinker API and Tinker Cookbook from Thinking Machines Lab. Failure-mode framing draws on the literature around RLHF reward hacking, sycophancy, verbosity bias, and confidence calibration.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
configs		configs
docs/example_run		docs/example_run
pte		pte
scripts		scripts
tests		tests
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Post-Training Failure Evals

A Reward-Hacking, Sycophancy & Calibration Detector for SFT / DPO / RLHF Models

📑 Table of contents

🔑 Headline findings

🧪 What it detects

🚀 Quickstart

🏋️ Train the variants: base → SFT → DPO

⚙️ How it works

📊 Results in detail

📚 Datasets

🗂 Repository layout

✅ Reproducibility & testing

⚠️ Limitations

📌 Citation

🙏 Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 Post-Training Failure Evals

A Reward-Hacking, Sycophancy & Calibration Detector for SFT / DPO / RLHF Models

📑 Table of contents

🔑 Headline findings

🧪 What it detects

🚀 Quickstart

🏋️ Train the variants: base → SFT → DPO

⚙️ How it works

📊 Results in detail

📚 Datasets

🗂 Repository layout

✅ Reproducibility & testing

⚠️ Limitations

📌 Citation

🙏 Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages