[evals] Add long-tail PPL gap reruns by dlwh · Pull Request #5075 · marin-community/marin

dlwh · 2026-04-22T20:04:39Z

Add a long-tail PPL registry, runnable public HF-backed slices, and durable Marin-vs-baseline perplexity-gap runners. Raw HF eval datasets now preserve non-validation splits so SVG Stack and VerilogEval can be evaluated without cache materialization.

Part of #5005

dlwh · 2026-04-22T20:04:49Z

🤖 Specification for the long-tail PPL rerun PR.

Problem: the perplexity-gap runner could only read the hardcoded validation split from raw HF datasets, which made runnable public eval slices like SVG Stack val and VerilogEval test silently unusable without cache materialization. The checkpoint-confidence epic also needed a reusable registry for long-tail symbolic/raw artifact evals instead of one-off scripts.

Approach: RawTextEvaluationDataset now carries a split field, _to_dataset_component passes that split into HfDatasetSourceConfig, and Levanter's raw document iterator asks each DatasetComponent for its configured split. File-backed sources still require validation so existing path-based evals keep the old contract. The new long-tail registry records the full first-pass target map from #5005/#5056-#5062, while long_tail_ppl_runnable.py exposes the currently runnable HF-backed SVG Stack and VerilogEval slices.

Key code:

source = component.source.get_shard_source(component.split)

return DatasetComponent(source=source, format=dataset_format, tags=list(config.tags), split=config.split)

Tests: tests/evals/test_long_tail_ppl.py checks registry coverage, deterministic raw paths/tags, runnable HF source ids/text keys/splits, smoke wrapper shape, and split preservation/rejection behavior. I also ran the long-tail runner in dry-run mode and uv run pytest tests/evals/test_long_tail_ppl.py tests/execution/test_executor.py -q.

claude · 2026-04-22T20:04:54Z

Claude finished @dlwh's task in 7m 4s —— View job

Code review

Reviewed the PR for bugs and AGENTS.md compliance. The core change — threading a split field from RawTextEvaluationDataset through _to_dataset_component → DatasetComponent → iter_raw_text_documents's get_shard_source(component.split) call — looks correctly wired, and file-backed sources keep the old validation-only contract via an explicit ValueError on lib/marin/src/marin/evaluation/perplexity_gap.py.

Findings

1 minor: tautological assertion in tests/evals/test_long_tail_ppl.py:103 — posted inline (#discussion_r3126641354).

No bugs, import errors, unresolved references, or dependency-direction violations found.

False positives considered and dismissed

default_model_perplexity_gap naming ("No default_* wrappers that obscure underlying mechanisms.") — the function is pre-existing and not introduced by this PR.
Magic number 32 at lib/levanter/src/levanter/analysis/perplexity_gap.py:238 — pre-existing; this PR only changes line 434 (the get_shard_source(component.split) call).
assert len(RUNNABLE_LONG_TAIL_PPL_SLICES) == 4 at tests/evals/test_long_tail_ppl.py:79 — paired with an explicit set-equality check on the following lines that already fully pins the registry contents, so it is a redundant-but-defensible sanity guard rather than a standalone tautology.
• Branch: codex/long-tail-ppl-gap-reruns

claude · 2026-04-22T20:11:50Z

+
+
+def test_smoke_wrapper_uses_tiny_doc_cap_and_runnable_bundle():
+    assert MAX_DOCS_PER_DATASET == 32


🤖 This line is a tautological test: MAX_DOCS_PER_DATASET is imported on L22 straight from the smoke module where it is literally MAX_DOCS_PER_DATASET = 32 (exp_model_perplexity_gap_long_tail_smoke.py#L10-L12), so the assertion only fires if someone edits that literal — it doesn't verify any behavior. The wiring check on L105 (STEP.config.max_docs_per_dataset == 32) already covers that the constant is threaded into the step.

This is the pattern explicitly called out in AGENTS.md under Testing ("Do not write tautological tests: tests must fail if behavior is wrong, not just if implementation changes") and LLM-Generated Code Pitfalls ("Tautological tests (type exists, constant has value)"). Suggest deleting this line.

🤖 Removed the imported constant assertion in 91f43b7. The smoke test now checks the runnable bundle and the step config doc cap directly.

dlwh · 2026-04-22T20:35:40Z

🤖 Updated after review: removed test_long_tail_registry_covers_expected_families_and_anchors, test_runnable_registry_is_separate_and_hf_backed, and test_smoke_wrapper_uses_tiny_doc_cap_and_runnable_bundle. The remaining tests cover nontrivial path/tag rendering, markdown rendering, filtering, and raw-dataset split behavior.

dlwh · 2026-04-22T20:50:58Z

🤖 Removed test_family_filter_returns_only_requested_family as well. The remaining long-tail eval tests now avoid registry self-checks and focus on rendered dataset contracts plus raw dataset split behavior.

dlwh · 2026-04-22T21:08:15Z

🤖 Removed the long-tail smoke experiment entrypoint and added an issue-linked module docstring to the remaining runnable long-tail gap experiment.

Adds HF-backed runnable slices for the GAME_MUSIC family on top of the long-tail PPL runnable registry introduced in #5075: - lichess_pgn_2013_06: Icannos/lichess_games @ config 2013-06, text column carries full PGN (headers, movetext, NAGs, comments, result); CC0. - irishman_abc: sander-wood/irishman validation split, column is the literal "abc notation" string; MIT / public domain. - melodyhub_abc_input: sander-wood/melodyhub validation split, ABC body with task-tag prefixes so the gap report can separate plain-ABC PPL from ABC-with-structural-prefix PPL. Also adds a focused pilot gap-report experiment (exp_model_perplexity_gap_symbolic_notation_pilot.py) running Marin-8B against Llama-3.1-8B, Qwen3-8B-Base, and Gemma-2-9b on just the game/music slices, with max_docs_per_dataset=2048 so each slice's compressed-byte volume stays roughly Paloma-sized (PGN/ABC docs are much shorter than an average Paloma doc). Humdrum/Kern has no clean HF mirror and remains as a stub in the long_tail_ppl.py registry for a later download-step mirror. Follow-up tokenizer-axis comparison filed as #5079. Co-authored-by: David Hall <dlwh@users.noreply.github.com>

Add long-tail PPL gap rerun registry

dad6f61

dlwh added the agent-generated Created by automation/agent label Apr 22, 2026

claude Bot reviewed Apr 22, 2026

View reviewed changes

dlwh force-pushed the codex/long-tail-ppl-gap-reruns branch from 91f43b7 to f50cf8d Compare April 22, 2026 20:35

Remove tautological long-tail eval assertions

a494858

dlwh force-pushed the codex/long-tail-ppl-gap-reruns branch from f50cf8d to a494858 Compare April 22, 2026 20:50

Remove long-tail smoke gap experiment

a3764c9

dlwh mentioned this pull request Apr 22, 2026

[evals] Add game and music notation PPL slices #5062

Open

This was referenced Apr 22, 2026

[evals] Tokenizer comparison for whitespace-sensitive long-tail PPL slices #5079

Open

[evals] Add game / music runnable PPL slices (issue #5062) #5080

Open

dlwh merged commit 9c8fcad into main Apr 22, 2026
37 checks passed

dlwh deleted the codex/long-tail-ppl-gap-reruns branch April 22, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add long-tail PPL gap reruns#5075

[evals] Add long-tail PPL gap reruns#5075
dlwh merged 3 commits intomainfrom
codex/long-tail-ppl-gap-reruns

dlwh commented Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

dlwh Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def test_smoke_wrapper_uses_tiny_doc_cap_and_runnable_bundle():
		assert MAX_DOCS_PER_DATASET == 32

Conversation

dlwh commented Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Findings

False positives considered and dismissed

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 22, 2026 •

edited

Loading