[evals] Add game / music runnable PPL slices (issue #5062)#5080
Open
claude[bot] wants to merge 1 commit intomainfrom
Open
[evals] Add game / music runnable PPL slices (issue #5062)#5080claude[bot] wants to merge 1 commit intomainfrom
claude[bot] wants to merge 1 commit intomainfrom
Conversation
Member
|
@claude retarget to main since 5075 landed |
Contributor
Author
|
Claude finished @dlwh's task in 4m 23s —— View job Retarget PR to main — done
Result The PR base was already
How I tested
|
Adds HF-backed runnable slices for the GAME_MUSIC family on top of the long-tail PPL runnable registry introduced in #5075: - lichess_pgn_2013_06: Icannos/lichess_games @ config 2013-06, text column carries full PGN (headers, movetext, NAGs, comments, result); CC0. - irishman_abc: sander-wood/irishman validation split, column is the literal "abc notation" string; MIT / public domain. - melodyhub_abc_input: sander-wood/melodyhub validation split, ABC body with task-tag prefixes so the gap report can separate plain-ABC PPL from ABC-with-structural-prefix PPL. Also adds a focused pilot gap-report experiment (exp_model_perplexity_gap_symbolic_notation_pilot.py) running Marin-8B against Llama-3.1-8B, Qwen3-8B-Base, and Gemma-2-9b on just the game/music slices, with max_docs_per_dataset=2048 so each slice's compressed-byte volume stays roughly Paloma-sized (PGN/ABC docs are much shorter than an average Paloma doc). Humdrum/Kern has no clean HF mirror and remains as a stub in the long_tail_ppl.py registry for a later download-step mirror. Follow-up tokenizer-axis comparison filed as #5079. Co-authored-by: David Hall <dlwh@users.noreply.github.com>
be597f7 to
c888751
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds HF-backed runnable PPL slices for the
GAME_MUSICfamily from #5062, plus a focused pilot gap-report experiment. Builds on the long-tail PPL runnable registry from #5075 (merged intomain).New runnable slices (experiments/evals/long_tail_ppl_runnable.py)
long_tail_ppl_runnable/game_music/lichess_pgn_2013_062013-06textcolumn = full PGN (headers, movetext, NAGs, comments, result)long_tail_ppl_runnable/game_music/irishman_abcvalidationabc notation(with space); full ABC headers, bar lines, repeats, chord symbols, decorationslong_tail_ppl_runnable/game_music/melodyhub_abc_inputvalidationAll three are HF-backed, so they follow the #5075 no-bulk-mirror pattern. Humdrum/Kern has no clean HF mirror and stays as a stub in
long_tail_ppl.pyfor a future download-step mirror.Pilot gap report (experiments/exp_model_perplexity_gap_symbolic_notation_pilot.py)
Marin-8B vs three references, scoped to the
GAME_MUSICfamily only (so SVG/Verilog do not drown out the symbolic-notation signal):meta-llama/Llama-3.1-8BQwen/Qwen3-8B-Basegoogle/gemma-2-9b(distinct SentencePiece tokenizer — explicit whitespace handling)max_docs_per_dataset=2048(up from the 256 used inexp_model_perplexity_gap_long_tail_runnable.py) to keep per-slice compressed-byte volume roughly Paloma-sized, per dlwh's note that PGN/ABC docs are shorter than average Paloma docs.The DoD question ('do gaps concentrate in metadata headers / symbolic sequences / comments / numeric annotations?') is answered by
GapReportBuilder's existing per-slice byte-bucket rollup — no post-hoc analysis here, just wiring.Follow-ups
Test plan
./infra/pre-commit.py --all-files --fixcleanuv run python -m pytest tests/evals/test_long_tail_ppl.py -v— 6 passed (4 existing + 2 new)uv run python -c 'import experiments.exp_model_perplexity_gap_symbolic_notation_pilot'— pilot module loads, 3 game/music datasets registered, 3MARIN_VS_*ExecutorSteps builtCloses #5062 once the pilot runs and the gap-report bucket breakdown is posted back to the issue.
Generated with Claude Code