🤖 Part of #5005.
Description
Add paired PPL/gap evals that test whether Marin models equivalent content across surface forms instead of only memorized text forms.
Initial sources:
Score both unconditional BPB over variants and conditional likelihood of a target variant given the source variant.
Definition of Done
- Add at least one paraphrase source and one translation source.
- Define a stable text linearization for paired examples.
- Report variant-level deltas, not only aggregate BPB.
- Keep held-out splits separate from any data-mixture iteration loop.
🤖 Part of #5005.
Description
Add paired PPL/gap evals that test whether Marin models equivalent content across surface forms instead of only memorized text forms.
Initial sources:
Score both unconditional BPB over variants and conditional likelihood of a target variant given the source variant.
Definition of Done