[evals] Add game / music runnable PPL slices (issue #5062) by claude[bot] · Pull Request #5080 · marin-community/marin

claude · 2026-04-22T21:20:23Z

Summary

Adds HF-backed runnable PPL slices for the GAME_MUSIC family from #5062, plus a focused pilot gap-report experiment. Builds on the long-tail PPL runnable registry from #5075 (merged into main).

New runnable slices (experiments/evals/long_tail_ppl_runnable.py)

registry key	source	license	notes
`long_tail_ppl_runnable/game_music/lichess_pgn_2013_06`	Icannos/lichess_games config `2013-06`	CC0	`text` column = full PGN (headers, movetext, NAGs, comments, result)
`long_tail_ppl_runnable/game_music/irishman_abc`	sander-wood/irishman `validation`	MIT / public domain	column is literal `abc notation` (with space); full ABC headers, bar lines, repeats, chord symbols, decorations
`long_tail_ppl_runnable/game_music/melodyhub_abc_input`	sander-wood/melodyhub `validation`	MIT	ABC with task-tag prefixes; second ABC flavour for structure vs. body PPL

All three are HF-backed, so they follow the #5075 no-bulk-mirror pattern. Humdrum/Kern has no clean HF mirror and stays as a stub in long_tail_ppl.py for a future download-step mirror.

Pilot gap report (experiments/exp_model_perplexity_gap_symbolic_notation_pilot.py)

Marin-8B vs three references, scoped to the GAME_MUSIC family only (so SVG/Verilog do not drown out the symbolic-notation signal):

meta-llama/Llama-3.1-8B
Qwen/Qwen3-8B-Base
google/gemma-2-9b (distinct SentencePiece tokenizer — explicit whitespace handling)

max_docs_per_dataset=2048 (up from the 256 used in exp_model_perplexity_gap_long_tail_runnable.py) to keep per-slice compressed-byte volume roughly Paloma-sized, per dlwh's note that PGN/ABC docs are shorter than average Paloma docs.

The DoD question ('do gaps concentrate in metadata headers / symbolic sequences / comments / numeric annotations?') is answered by GapReportBuilder's existing per-slice byte-bucket rollup — no post-hoc analysis here, just wiring.

Follow-ups

[evals] Tokenizer comparison for whitespace-sensitive long-tail PPL slices #5079 filed for the tokenizer-axis sweep (whitespace-sensitive slices — kern, ABC, PGN, Verilog).

Test plan

./infra/pre-commit.py --all-files --fix clean
uv run python -m pytest tests/evals/test_long_tail_ppl.py -v — 6 passed (4 existing + 2 new)
uv run python -c 'import experiments.exp_model_perplexity_gap_symbolic_notation_pilot' — pilot module loads, 3 game/music datasets registered, 3 MARIN_VS_* ExecutorSteps built
Actually execute the pilot on a v5p-8 (requires TPU; out of scope for CI)

Closes #5062 once the pilot runs and the gap-report bucket breakdown is posted back to the issue.

Generated with Claude Code

dlwh · 2026-04-22T21:43:40Z

@claude retarget to main since 5075 landed

claude · 2026-04-22T21:44:22Z

Claude finished @dlwh's task in 4m 23s —— View job

Retarget PR to main — done

Verified PR already targeted main (base = main, unchanged)
Confirmed [evals] Add long-tail PPL gap reruns #5075 merged into main at 9c8fcad2a
Rebased branch onto origin/main: dropped the 3 now-squashed commits from [evals] Add long-tail PPL gap reruns #5075, kept only the game/music work as a single commit c8887516f
Re-ran ./infra/pre-commit.py --all-files --fix — clean
Re-ran uv run pytest tests/evals/test_long_tail_ppl.py -v — 6 passed (4 existing + 2 new)
Re-verified import experiments.exp_model_perplexity_gap_symbolic_notation_pilot — 3 GAME_MUSIC datasets registered, 3 MARIN_VS_* ExecutorSteps (Llama/Qwen3/Gemma-2) built
Updated PR body to drop the stale "retarget to main if [evals] Add long-tail PPL gap reruns #5075 merges first" note
Force-pushed with --force-with-lease

Result

The PR base was already main (GitHub auto-retargets to the merge commit when a base PR lands). The real work here was rebasing the branch onto the now-updated main so the diff is clean. Post-rebase:

origin/main..HEAD → 1 commit, 3 files, +233 lines (was showing +992/-3 pre-rebase because [evals] Add long-tail PPL gap reruns #5075's content was double-counted before it landed on main).
Touched: experiments/evals/long_tail_ppl_runnable.py (+43, game/music slices), experiments/exp_model_perplexity_gap_symbolic_notation_pilot.py (+144, new pilot), tests/evals/test_long_tail_ppl.py (+46, 2 new tests).

How I tested

./infra/pre-commit.py --all-files --fix — all hooks green (Ruff, Black, license headers, Pyrefly, etc.)
uv run pytest tests/evals/test_long_tail_ppl.py -v -o "addopts=-m 'not tpu_ci and not slow and not integration'" — 6 passed in 7.55s. (Had to override addopts because the runner image doesn't ship --session-timeout support; the test set itself is unchanged.)
uv run python -c 'import experiments.exp_model_perplexity_gap_symbolic_notation_pilot as m; print(len(m.DATASETS), m.MARIN_VS_LLAMA.name, m.MARIN_VS_QWEN3.name, m.MARIN_VS_GEMMA2.name)' — loads cleanly, 3 datasets, 3 steps.
• Branch: claude/issue-5062-20260422-2112

Adds HF-backed runnable slices for the GAME_MUSIC family on top of the long-tail PPL runnable registry introduced in #5075: - lichess_pgn_2013_06: Icannos/lichess_games @ config 2013-06, text column carries full PGN (headers, movetext, NAGs, comments, result); CC0. - irishman_abc: sander-wood/irishman validation split, column is the literal "abc notation" string; MIT / public domain. - melodyhub_abc_input: sander-wood/melodyhub validation split, ABC body with task-tag prefixes so the gap report can separate plain-ABC PPL from ABC-with-structural-prefix PPL. Also adds a focused pilot gap-report experiment (exp_model_perplexity_gap_symbolic_notation_pilot.py) running Marin-8B against Llama-3.1-8B, Qwen3-8B-Base, and Gemma-2-9b on just the game/music slices, with max_docs_per_dataset=2048 so each slice's compressed-byte volume stays roughly Paloma-sized (PGN/ABC docs are much shorter than an average Paloma doc). Humdrum/Kern has no clean HF mirror and remains as a stub in the long_tail_ppl.py registry for a later download-step mirror. Follow-up tokenizer-axis comparison filed as #5079. Co-authored-by: David Hall <dlwh@users.noreply.github.com>

claude Bot added the agent-generated Created by automation/agent label Apr 22, 2026

claude Bot mentioned this pull request Apr 22, 2026

[evals] Add game and music notation PPL slices #5062

Open

Base automatically changed from codex/long-tail-ppl-gap-reruns to main April 22, 2026 21:30

claude Bot force-pushed the claude/issue-5062-20260422-2112 branch from be597f7 to c888751 Compare April 22, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add game / music runnable PPL slices (issue #5062)#5080

[evals] Add game / music runnable PPL slices (issue #5062)#5080
claude[bot] wants to merge 1 commit intomainfrom
claude/issue-5062-20260422-2112

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New runnable slices (experiments/evals/long_tail_ppl_runnable.py)

Pilot gap report (experiments/exp_model_perplexity_gap_symbolic_notation_pilot.py)

Follow-ups

Test plan

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retarget PR to main — done

How I tested

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 22, 2026 •

edited

Loading

claude Bot commented Apr 22, 2026 •

edited

Loading