[evals] Add paraphrase and translation robustness PPL evals

🤖 Part of #5005.

## Description
Add paired PPL/gap evals that test whether Marin models equivalent content across surface forms instead of only memorized text forms.

Initial sources:
- PAWS / PAWS-X: https://github.com/google-research-datasets/paws
- ParaNMT-50M: https://aclanthology.org/P18-1042/
- ParaSCI: https://arxiv.org/abs/2101.08382
- FLORES-200: https://huggingface.co/datasets/facebook/flores
- Project CodeNet for code translation-style pairs: https://github.com/IBM/Project_CodeNet

Score both unconditional BPB over variants and conditional likelihood of a target variant given the source variant.

### Definition of Done
- Add at least one paraphrase source and one translation source.
- Define a stable text linearization for paired examples.
- Report variant-level deltas, not only aggregate BPB.
- Keep held-out splits separate from any data-mixture iteration loop.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add paraphrase and translation robustness PPL evals #5096

Description

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[evals] Add paraphrase and translation robustness PPL evals #5096

Description

Description

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions