Skip to content

[evals] Add game / music runnable PPL slices (issue #5062)#5080

Open
claude[bot] wants to merge 1 commit intomainfrom
claude/issue-5062-20260422-2112
Open

[evals] Add game / music runnable PPL slices (issue #5062)#5080
claude[bot] wants to merge 1 commit intomainfrom
claude/issue-5062-20260422-2112

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Apr 22, 2026

Summary

Adds HF-backed runnable PPL slices for the GAME_MUSIC family from #5062, plus a focused pilot gap-report experiment. Builds on the long-tail PPL runnable registry from #5075 (merged into main).

New runnable slices (experiments/evals/long_tail_ppl_runnable.py)

registry key source license notes
long_tail_ppl_runnable/game_music/lichess_pgn_2013_06 Icannos/lichess_games config 2013-06 CC0 text column = full PGN (headers, movetext, NAGs, comments, result)
long_tail_ppl_runnable/game_music/irishman_abc sander-wood/irishman validation MIT / public domain column is literal abc notation (with space); full ABC headers, bar lines, repeats, chord symbols, decorations
long_tail_ppl_runnable/game_music/melodyhub_abc_input sander-wood/melodyhub validation MIT ABC with task-tag prefixes; second ABC flavour for structure vs. body PPL

All three are HF-backed, so they follow the #5075 no-bulk-mirror pattern. Humdrum/Kern has no clean HF mirror and stays as a stub in long_tail_ppl.py for a future download-step mirror.

Pilot gap report (experiments/exp_model_perplexity_gap_symbolic_notation_pilot.py)

Marin-8B vs three references, scoped to the GAME_MUSIC family only (so SVG/Verilog do not drown out the symbolic-notation signal):

  • meta-llama/Llama-3.1-8B
  • Qwen/Qwen3-8B-Base
  • google/gemma-2-9b (distinct SentencePiece tokenizer — explicit whitespace handling)

max_docs_per_dataset=2048 (up from the 256 used in exp_model_perplexity_gap_long_tail_runnable.py) to keep per-slice compressed-byte volume roughly Paloma-sized, per dlwh's note that PGN/ABC docs are shorter than average Paloma docs.

The DoD question ('do gaps concentrate in metadata headers / symbolic sequences / comments / numeric annotations?') is answered by GapReportBuilder's existing per-slice byte-bucket rollup — no post-hoc analysis here, just wiring.

Follow-ups

Test plan

  • ./infra/pre-commit.py --all-files --fix clean
  • uv run python -m pytest tests/evals/test_long_tail_ppl.py -v — 6 passed (4 existing + 2 new)
  • uv run python -c 'import experiments.exp_model_perplexity_gap_symbolic_notation_pilot' — pilot module loads, 3 game/music datasets registered, 3 MARIN_VS_* ExecutorSteps built
  • Actually execute the pilot on a v5p-8 (requires TPU; out of scope for CI)

Closes #5062 once the pilot runs and the gap-report bucket breakdown is posted back to the issue.

Generated with Claude Code

@claude claude Bot added the agent-generated Created by automation/agent label Apr 22, 2026
Base automatically changed from codex/long-tail-ppl-gap-reruns to main April 22, 2026 21:30
@dlwh
Copy link
Copy Markdown
Member

dlwh commented Apr 22, 2026

@claude retarget to main since 5075 landed

@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Apr 22, 2026

Claude finished @dlwh's task in 4m 23s —— View job


Retarget PR to main — done

  • Verified PR already targeted main (base = main, unchanged)
  • Confirmed [evals] Add long-tail PPL gap reruns #5075 merged into main at 9c8fcad2a
  • Rebased branch onto origin/main: dropped the 3 now-squashed commits from [evals] Add long-tail PPL gap reruns #5075, kept only the game/music work as a single commit c8887516f
  • Re-ran ./infra/pre-commit.py --all-files --fix — clean
  • Re-ran uv run pytest tests/evals/test_long_tail_ppl.py -v6 passed (4 existing + 2 new)
  • Re-verified import experiments.exp_model_perplexity_gap_symbolic_notation_pilot — 3 GAME_MUSIC datasets registered, 3 MARIN_VS_* ExecutorSteps (Llama/Qwen3/Gemma-2) built
  • Updated PR body to drop the stale "retarget to main if [evals] Add long-tail PPL gap reruns #5075 merges first" note
  • Force-pushed with --force-with-lease

Result

The PR base was already main (GitHub auto-retargets to the merge commit when a base PR lands). The real work here was rebasing the branch onto the now-updated main so the diff is clean. Post-rebase:

  • origin/main..HEAD1 commit, 3 files, +233 lines (was showing +992/-3 pre-rebase because [evals] Add long-tail PPL gap reruns #5075's content was double-counted before it landed on main).
  • Touched: experiments/evals/long_tail_ppl_runnable.py (+43, game/music slices), experiments/exp_model_perplexity_gap_symbolic_notation_pilot.py (+144, new pilot), tests/evals/test_long_tail_ppl.py (+46, 2 new tests).

How I tested

  • ./infra/pre-commit.py --all-files --fix — all hooks green (Ruff, Black, license headers, Pyrefly, etc.)
  • uv run pytest tests/evals/test_long_tail_ppl.py -v -o "addopts=-m 'not tpu_ci and not slow and not integration'" — 6 passed in 7.55s. (Had to override addopts because the runner image doesn't ship --session-timeout support; the test set itself is unchanged.)
  • uv run python -c 'import experiments.exp_model_perplexity_gap_symbolic_notation_pilot as m; print(len(m.DATASETS), m.MARIN_VS_LLAMA.name, m.MARIN_VS_QWEN3.name, m.MARIN_VS_GEMMA2.name)' — loads cleanly, 3 datasets, 3 steps.
    • Branch: claude/issue-5062-20260422-2112

Adds HF-backed runnable slices for the GAME_MUSIC family on top of the
long-tail PPL runnable registry introduced in #5075:

- lichess_pgn_2013_06: Icannos/lichess_games @ config 2013-06, text column
  carries full PGN (headers, movetext, NAGs, comments, result); CC0.
- irishman_abc: sander-wood/irishman validation split, column is the
  literal "abc notation" string; MIT / public domain.
- melodyhub_abc_input: sander-wood/melodyhub validation split, ABC body
  with task-tag prefixes so the gap report can separate
  plain-ABC PPL from ABC-with-structural-prefix PPL.

Also adds a focused pilot gap-report experiment
(exp_model_perplexity_gap_symbolic_notation_pilot.py) running Marin-8B
against Llama-3.1-8B, Qwen3-8B-Base, and Gemma-2-9b on just the
game/music slices, with max_docs_per_dataset=2048 so each slice's
compressed-byte volume stays roughly Paloma-sized (PGN/ABC docs are
much shorter than an average Paloma doc).

Humdrum/Kern has no clean HF mirror and remains as a stub in the
long_tail_ppl.py registry for a later download-step mirror.

Follow-up tokenizer-axis comparison filed as #5079.

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
@claude claude Bot force-pushed the claude/issue-5062-20260422-2112 branch from be597f7 to c888751 Compare April 22, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[evals] Add game and music notation PPL slices

1 participant