Skip to content

[evals] Add long-tail PPL gap reruns#5075

Merged
dlwh merged 3 commits intomainfrom
codex/long-tail-ppl-gap-reruns
Apr 22, 2026
Merged

[evals] Add long-tail PPL gap reruns#5075
dlwh merged 3 commits intomainfrom
codex/long-tail-ppl-gap-reruns

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 22, 2026

Add a long-tail PPL registry, runnable public HF-backed slices, and durable Marin-vs-baseline perplexity-gap runners. Raw HF eval datasets now preserve non-validation splits so SVG Stack and VerilogEval can be evaluated without cache materialization.

Part of #5005

@dlwh dlwh added the agent-generated Created by automation/agent label Apr 22, 2026
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Apr 22, 2026

🤖 Specification for the long-tail PPL rerun PR.

Problem: the perplexity-gap runner could only read the hardcoded validation split from raw HF datasets, which made runnable public eval slices like SVG Stack val and VerilogEval test silently unusable without cache materialization. The checkpoint-confidence epic also needed a reusable registry for long-tail symbolic/raw artifact evals instead of one-off scripts.

Approach: RawTextEvaluationDataset now carries a split field, _to_dataset_component passes that split into HfDatasetSourceConfig, and Levanter's raw document iterator asks each DatasetComponent for its configured split. File-backed sources still require validation so existing path-based evals keep the old contract. The new long-tail registry records the full first-pass target map from #5005/#5056-#5062, while long_tail_ppl_runnable.py exposes the currently runnable HF-backed SVG Stack and VerilogEval slices.

Key code:

source = component.source.get_shard_source(component.split)
return DatasetComponent(source=source, format=dataset_format, tags=list(config.tags), split=config.split)

Tests: tests/evals/test_long_tail_ppl.py checks registry coverage, deterministic raw paths/tags, runnable HF source ids/text keys/splits, smoke wrapper shape, and split preservation/rejection behavior. I also ran the long-tail runner in dry-run mode and uv run pytest tests/evals/test_long_tail_ppl.py tests/execution/test_executor.py -q.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @dlwh's task in 7m 4s —— View job


Code review

Reviewed the PR for bugs and AGENTS.md compliance. The core change — threading a split field from RawTextEvaluationDataset through _to_dataset_componentDatasetComponentiter_raw_text_documents's get_shard_source(component.split) call — looks correctly wired, and file-backed sources keep the old validation-only contract via an explicit ValueError on lib/marin/src/marin/evaluation/perplexity_gap.py.

Findings

  • 1 minor: tautological assertion in tests/evals/test_long_tail_ppl.py:103 — posted inline (#discussion_r3126641354).

No bugs, import errors, unresolved references, or dependency-direction violations found.

False positives considered and dismissed

  • default_model_perplexity_gap naming ("No default_* wrappers that obscure underlying mechanisms.") — the function is pre-existing and not introduced by this PR.
  • Magic number 32 at lib/levanter/src/levanter/analysis/perplexity_gap.py:238 — pre-existing; this PR only changes line 434 (the get_shard_source(component.split) call).
  • assert len(RUNNABLE_LONG_TAIL_PPL_SLICES) == 4 at tests/evals/test_long_tail_ppl.py:79 — paired with an explicit set-equality check on the following lines that already fully pins the registry contents, so it is a redundant-but-defensible sanity guard rather than a standalone tautology.
    • Branch: codex/long-tail-ppl-gap-reruns

Comment thread tests/evals/test_long_tail_ppl.py Outdated


def test_smoke_wrapper_uses_tiny_doc_cap_and_runnable_bundle():
assert MAX_DOCS_PER_DATASET == 32
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This line is a tautological test: MAX_DOCS_PER_DATASET is imported on L22 straight from the smoke module where it is literally MAX_DOCS_PER_DATASET = 32 (exp_model_perplexity_gap_long_tail_smoke.py#L10-L12), so the assertion only fires if someone edits that literal — it doesn't verify any behavior. The wiring check on L105 (STEP.config.max_docs_per_dataset == 32) already covers that the constant is threaded into the step.

This is the pattern explicitly called out in AGENTS.md under Testing ("Do not write tautological tests: tests must fail if behavior is wrong, not just if implementation changes") and LLM-Generated Code Pitfalls ("Tautological tests (type exists, constant has value)"). Suggest deleting this line.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Removed the imported constant assertion in 91f43b7. The smoke test now checks the runnable bundle and the step config doc cap directly.

@dlwh dlwh force-pushed the codex/long-tail-ppl-gap-reruns branch from 91f43b7 to f50cf8d Compare April 22, 2026 20:35
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Apr 22, 2026

🤖 Updated after review: removed test_long_tail_registry_covers_expected_families_and_anchors, test_runnable_registry_is_separate_and_hf_backed, and test_smoke_wrapper_uses_tiny_doc_cap_and_runnable_bundle. The remaining tests cover nontrivial path/tag rendering, markdown rendering, filtering, and raw-dataset split behavior.

@dlwh dlwh force-pushed the codex/long-tail-ppl-gap-reruns branch from f50cf8d to a494858 Compare April 22, 2026 20:50
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Apr 22, 2026

🤖 Removed test_family_filter_returns_only_requested_family as well. The remaining long-tail eval tests now avoid registry self-checks and focus on rendered dataset contracts plus raw dataset split behavior.

@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Apr 22, 2026

🤖 Removed the long-tail smoke experiment entrypoint and added an issue-linked module docstring to the remaining runnable long-tail gap experiment.

@dlwh dlwh merged commit 9c8fcad into main Apr 22, 2026
37 checks passed
@dlwh dlwh deleted the codex/long-tail-ppl-gap-reruns branch April 22, 2026 21:30
claude Bot pushed a commit that referenced this pull request Apr 22, 2026
Adds HF-backed runnable slices for the GAME_MUSIC family on top of the
long-tail PPL runnable registry introduced in #5075:

- lichess_pgn_2013_06: Icannos/lichess_games @ config 2013-06, text column
  carries full PGN (headers, movetext, NAGs, comments, result); CC0.
- irishman_abc: sander-wood/irishman validation split, column is the
  literal "abc notation" string; MIT / public domain.
- melodyhub_abc_input: sander-wood/melodyhub validation split, ABC body
  with task-tag prefixes so the gap report can separate
  plain-ABC PPL from ABC-with-structural-prefix PPL.

Also adds a focused pilot gap-report experiment
(exp_model_perplexity_gap_symbolic_notation_pilot.py) running Marin-8B
against Llama-3.1-8B, Qwen3-8B-Base, and Gemma-2-9b on just the
game/music slices, with max_docs_per_dataset=2048 so each slice's
compressed-byte volume stays roughly Paloma-sized (PGN/ABC docs are
much shorter than an average Paloma doc).

Humdrum/Kern has no clean HF mirror and remains as a stub in the
long_tail_ppl.py registry for a later download-step mirror.

Follow-up tokenizer-axis comparison filed as #5079.

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant