Skip to content

[evals] Add paraphrase and translation robustness PPL evals #5096

@dlwh

Description

@dlwh

🤖 Part of #5005.

Description

Add paired PPL/gap evals that test whether Marin models equivalent content across surface forms instead of only memorized text forms.

Initial sources:

Score both unconditional BPB over variants and conditional likelihood of a target variant given the source variant.

Definition of Done

  • Add at least one paraphrase source and one translation source.
  • Define a stable text linearization for paired examples.
  • Report variant-level deltas, not only aggregate BPB.
  • Keep held-out splits separate from any data-mixture iteration loop.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions