Conversation
|
🤖 Specification for the long-tail PPL rerun PR. Problem: the perplexity-gap runner could only read the hardcoded Approach: Key code: source = component.source.get_shard_source(component.split)return DatasetComponent(source=source, format=dataset_format, tags=list(config.tags), split=config.split)Tests: |
|
Claude finished @dlwh's task in 7m 4s —— View job Code reviewReviewed the PR for bugs and AGENTS.md compliance. The core change — threading a Findings
No bugs, import errors, unresolved references, or dependency-direction violations found. False positives considered and dismissed
|
|
|
||
|
|
||
| def test_smoke_wrapper_uses_tiny_doc_cap_and_runnable_bundle(): | ||
| assert MAX_DOCS_PER_DATASET == 32 |
There was a problem hiding this comment.
🤖 This line is a tautological test: MAX_DOCS_PER_DATASET is imported on L22 straight from the smoke module where it is literally MAX_DOCS_PER_DATASET = 32 (exp_model_perplexity_gap_long_tail_smoke.py#L10-L12), so the assertion only fires if someone edits that literal — it doesn't verify any behavior. The wiring check on L105 (STEP.config.max_docs_per_dataset == 32) already covers that the constant is threaded into the step.
This is the pattern explicitly called out in AGENTS.md under Testing ("Do not write tautological tests: tests must fail if behavior is wrong, not just if implementation changes") and LLM-Generated Code Pitfalls ("Tautological tests (type exists, constant has value)"). Suggest deleting this line.
There was a problem hiding this comment.
🤖 Removed the imported constant assertion in 91f43b7. The smoke test now checks the runnable bundle and the step config doc cap directly.
91f43b7 to
f50cf8d
Compare
|
🤖 Updated after review: removed |
f50cf8d to
a494858
Compare
|
🤖 Removed |
|
🤖 Removed the long-tail smoke experiment entrypoint and added an issue-linked module docstring to the remaining runnable long-tail gap experiment. |
Adds HF-backed runnable slices for the GAME_MUSIC family on top of the long-tail PPL runnable registry introduced in #5075: - lichess_pgn_2013_06: Icannos/lichess_games @ config 2013-06, text column carries full PGN (headers, movetext, NAGs, comments, result); CC0. - irishman_abc: sander-wood/irishman validation split, column is the literal "abc notation" string; MIT / public domain. - melodyhub_abc_input: sander-wood/melodyhub validation split, ABC body with task-tag prefixes so the gap report can separate plain-ABC PPL from ABC-with-structural-prefix PPL. Also adds a focused pilot gap-report experiment (exp_model_perplexity_gap_symbolic_notation_pilot.py) running Marin-8B against Llama-3.1-8B, Qwen3-8B-Base, and Gemma-2-9b on just the game/music slices, with max_docs_per_dataset=2048 so each slice's compressed-byte volume stays roughly Paloma-sized (PGN/ABC docs are much shorter than an average Paloma doc). Humdrum/Kern has no clean HF mirror and remains as a stub in the long_tail_ppl.py registry for a later download-step mirror. Follow-up tokenizer-axis comparison filed as #5079. Co-authored-by: David Hall <dlwh@users.noreply.github.com>
Add a long-tail PPL registry, runnable public HF-backed slices, and durable Marin-vs-baseline perplexity-gap runners. Raw HF eval datasets now preserve non-validation splits so SVG Stack and VerilogEval can be evaluated without cache materialization.
Part of #5005