Skip to content

[evals] Add capped paired robustness PPL slices#5120

Open
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5096
Open

[evals] Add capped paired robustness PPL slices#5120
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5096

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add opt-in paired robustness PPL wiring for PAWS and FLORES with explicit held-out splits and capped sample sizes. Materialize stable source, target, and target_given_source text views into raw eval shards, then include them in a dedicated perplexity-gap experiment without touching default validation bundles. Add synthetic coverage for split selection, filtering, and linearization/path registration.

Fixes #5096

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 4m 6s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md / AGENTS.md compliance.

Scope covered:

  • Import resolution (fray.v2.ResourceConfig, marin.evaluation.perplexity_gap.{RawTextEvaluationDataset, raw_text_dataset}, marin.execution.executor.{ExecutorStep, this_output_path, InputName}, marin.execution.remote.remote, zephyr.write_jsonl_file, experiments.defaults.default_raw_validation_sets) — all resolve.
  • API usage: ResourceConfig.with_cpu(cpu=…, ram=…, disk=…), ResourceConfig.with_tpu("v5p-8", regions=[…]), remote(fn, resources=…, pip_dependency_groups=…), ExecutorStep.cd(name) returning InputName, write_jsonl_file(records, path) returning {"path", "count"}, field(default_factory=this_output_path) — all match current signatures.
  • Logical consistency: raw_step_key uniqueness across the four slice entries holds (family/name/split), _flush_view_shard's buffer reset mutates the shared dict correctly, zero-pair ValueError guard is reached after the read loop, and the HuggingFace streaming path is closed out by _flush_view_shard at end-of-iteration.
  • Test expectations align with actual object shapes: InputName.name format ("{view}/shard-*.jsonl.gz") matches raw_step.cd(posixpath.join(view.value, "shard-*.jsonl.gz")), dataset keys match dataset_key(view), and tag tuples include split:*, family:*, view:*.
  • CLAUDE.md / AGENTS.md compliance: top-level imports, StrEnum for PairedRobustnessFamily/PairedTextView, dataclasses over dicts, logging over print, raise ValueError on invariants in __post_init__ and missing fields, no backward-compat shims, no *_utils.py, top-level constants for caps/shard sizes.
    · branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[evals] Add paraphrase and translation robustness PPL evals

1 participant