[evals] Add raw web/markup/image-text PPL registration helper#5084
[evals] Add raw web/markup/image-text PPL registration helper#5084
Conversation
…5056) Skeleton for #5056 (parent #5005). Adds an empty aggregator module for surface-preserving PPL eval slices (HTML, WARC/WAT, web tables, SVG XML, OCR strings, captions, alt-text) and wires it into default_raw_validation_sets() so follow-up PRs can register downloaders without touching the eval runner. Slices oversplit by surface form; see module docstring for the design rationale. Co-authored-by: David Hall <dlwh@users.noreply.github.com>
|
Claude finished @dlwh's task in 2m 59s —— View job Review
Behavioral correctness is fine — the wiring into Main concern: empty skeleton vs. land-with-first-downloaderThe PR is a 70-line registration point that currently registers nothing. CLAUDE.md is explicit:
The stated justification — "keep perplexity-gap runners unchanged while follow-up downloader PRs register surface-form slices incrementally" — isn't very load-bearing: the wiring in I'd suggest closing this PR and folding its content into the first downloader PR, so reviewers can see the registration point exercised by at least one real entry. If you do keep it as a skeleton, the module should shrink substantially (see inline comments). Secondary: comments / docstrings violate CLAUDE.mdMultiple places reference the current task / parent issue numbers and describe hypothetical future PRs. CLAUDE.md:
Specifically:
Typing
Minor:
|
| # Copyright The Marin Authors | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """ | ||
| #5056: Raw web, markup, and image-text PPL slices. | ||
|
|
||
| Parent: #5005. | ||
|
|
||
| Byte-level perplexity-gap eval slices that preserve surface syntax normally | ||
| stripped by cleaned web corpora (HTML, WARC/WAT metadata, web tables, SVG XML, | ||
| OCR strings, captions, alt-text, EXIF-like metadata, URL-heavy records). The | ||
| goal is to surface perplexity-gap buckets that cleaned-corpus slices | ||
| (Paloma / uncheatable-eval) hide. | ||
|
|
||
| Targets (tracked in #5056): | ||
| - Common Crawl WARC/WAT: raw HTTP headers, raw HTML, WAT JSON. | ||
| - Web Data Commons Web Tables: raw <table> HTML plus extracted JSON metadata. | ||
| - SVG-Stack: SVG XML programs and captions. | ||
| - TextOCR / OCR-VQA: OCR strings and scene-text annotations. | ||
| - LAION metadata: URL / alt-text / EXIF-like fields — deferred pending | ||
| explicit subset selection and safety-filter review. | ||
|
|
||
| Per #5056 design review, slices are oversplit by surface form (one entry per | ||
| surface form, e.g. `raw_web_markup/cc_warc_html`, `raw_web_markup/cc_wat_json`) | ||
| so the gap-finder bucket analysis in `marin/evaluation/perplexity_gap.py` | ||
| stays clean. Grouping happens post-hoc via tags. | ||
|
|
||
| This module is intentionally a registration point: downloaders land in | ||
| follow-up PRs and populate ``ACTIVE_RAW_WEB_MARKUP_DATASETS``. The aggregator | ||
| ``raw_web_markup_raw_validation_sets()`` is wired into | ||
| ``experiments/defaults.py::default_raw_validation_sets()`` so new slices flow | ||
| into ``exp_model_perplexity_gap_marin_vs_llama.py`` and its siblings without | ||
| touching any other file. | ||
| """ | ||
|
|
||
| import posixpath | ||
| from typing import Any | ||
|
|
||
| # Prefix applied to every slice name so the gap-finder report groups the new | ||
| # surface-preserving slices together. Top-level constant per CLAUDE.md. | ||
| RAW_WEB_MARKUP_PREFIX = "raw_web_markup" | ||
|
|
||
| # Populated by follow-up PRs. Keys are slice names relative to | ||
| # ``RAW_WEB_MARKUP_PREFIX`` (e.g. ``cc_warc_html``); values are | ||
| # ``RawTextEvaluationDataset`` instances produced via | ||
| # ``marin.evaluation.perplexity_gap.raw_text_dataset``. | ||
| # | ||
| # Convention: each source contributes one entry per surface form. Do not | ||
| # concatenate surfaces into a single ``text`` stream — the gap-finder truncates | ||
| # each doc to ``max_doc_bytes=32_768`` and reports per-slice bpb, so mixing | ||
| # surfaces inside a slice loses the signal we want. | ||
| ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] = {} | ||
|
|
||
|
|
||
| def raw_web_markup_raw_validation_sets() -> dict[str, Any]: | ||
| """Return raw-text eval slices covering web markup and image-adjacent text. | ||
|
|
||
| Slice names are prefixed with :data:`RAW_WEB_MARKUP_PREFIX`. Returns an | ||
| empty mapping until downloaders land in the follow-up PRs tracked by | ||
| #5056; callers should treat an empty result as "no raw-web-markup slices | ||
| are registered yet", not as an error. | ||
| """ | ||
| return { | ||
| posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset | ||
| for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items() | ||
| } |
There was a problem hiding this comment.
🤖 Empty registration point — consider landing with the first downloader instead.
The module doesn't do anything yet: ACTIVE_RAW_WEB_MARKUP_DATASETS is {}, so raw_web_markup_raw_validation_sets() always returns {}, and the .update({}) call in defaults.py is a no-op.
CLAUDE.md is explicit on this:
Don't add features, refactor, or introduce abstractions beyond what the task requires. […] Don't design for hypothetical future requirements. […] No half-finished implementations either.
and:
Delete dead code: unused parameters, stale options, old experiments.
The argument for this PR is "wire up the registration point now so follow-ups can register without touching defaults.py". But the wiring is two lines — one import, one .update(...). That's not reviewer-friction worth saving by pre-landing a 70-line skeleton. The follow-up PRs would touch defaults.py anyway only if this PR weren't there, which is fine. I'd rather see this module appear in the same PR as the first downloader, so it's reviewable against concrete usage (you can see at least one populated entry, confirm the prefix / naming convention matches the gap-finder's expectations, etc.).
If you do want to keep a scaffold, it should be as minimal as possible (no module-level mutable ACTIVE_… dict, no posixpath.join over an empty mapping, no docstring paragraphs that will rot before they're used).
| """ | ||
| #5056: Raw web, markup, and image-text PPL slices. | ||
|
|
||
| Parent: #5005. | ||
|
|
||
| Byte-level perplexity-gap eval slices that preserve surface syntax normally | ||
| stripped by cleaned web corpora (HTML, WARC/WAT metadata, web tables, SVG XML, | ||
| OCR strings, captions, alt-text, EXIF-like metadata, URL-heavy records). The | ||
| goal is to surface perplexity-gap buckets that cleaned-corpus slices | ||
| (Paloma / uncheatable-eval) hide. | ||
|
|
||
| Targets (tracked in #5056): | ||
| - Common Crawl WARC/WAT: raw HTTP headers, raw HTML, WAT JSON. | ||
| - Web Data Commons Web Tables: raw <table> HTML plus extracted JSON metadata. | ||
| - SVG-Stack: SVG XML programs and captions. | ||
| - TextOCR / OCR-VQA: OCR strings and scene-text annotations. | ||
| - LAION metadata: URL / alt-text / EXIF-like fields — deferred pending | ||
| explicit subset selection and safety-filter review. | ||
|
|
||
| Per #5056 design review, slices are oversplit by surface form (one entry per | ||
| surface form, e.g. `raw_web_markup/cc_warc_html`, `raw_web_markup/cc_wat_json`) | ||
| so the gap-finder bucket analysis in `marin/evaluation/perplexity_gap.py` | ||
| stays clean. Grouping happens post-hoc via tags. | ||
|
|
||
| This module is intentionally a registration point: downloaders land in | ||
| follow-up PRs and populate ``ACTIVE_RAW_WEB_MARKUP_DATASETS``. The aggregator | ||
| ``raw_web_markup_raw_validation_sets()`` is wired into | ||
| ``experiments/defaults.py::default_raw_validation_sets()`` so new slices flow | ||
| into ``exp_model_perplexity_gap_marin_vs_llama.py`` and its siblings without | ||
| touching any other file. | ||
| """ |
There was a problem hiding this comment.
🤖 Docstring references tracking issues that will rot.
CLAUDE.md:
Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.
This docstring opens with #5056: … / Parent: #5005, then refers back to "#5056 design review", "tracked in #5056", and "follow-up PRs". Once those issues are closed or renumbered, the docstring becomes misleading.
A module docstring should describe what the module is in the current codebase (one or two sentences about surface-preserving PPL slices, the prefix convention, and where the dataset values come from). Targets and rationale belong in the issue, not here.
| # Prefix applied to every slice name so the gap-finder report groups the new | ||
| # surface-preserving slices together. Top-level constant per CLAUDE.md. | ||
| RAW_WEB_MARKUP_PREFIX = "raw_web_markup" |
There was a problem hiding this comment.
🤖 Drop "per CLAUDE.md" meta-comment.
# Top-level constant per CLAUDE.md. explains why the code is written this way in this repo, not anything about the constant itself. Style-guide adherence is not a code comment; a future reader doesn't need (or want) a citation of the style guide inline with the declaration. Just:
RAW_WEB_MARKUP_PREFIX = "raw_web_markup"The name and value are self-explanatory; no comment needed.
| # Populated by follow-up PRs. Keys are slice names relative to | ||
| # ``RAW_WEB_MARKUP_PREFIX`` (e.g. ``cc_warc_html``); values are | ||
| # ``RawTextEvaluationDataset`` instances produced via | ||
| # ``marin.evaluation.perplexity_gap.raw_text_dataset``. | ||
| # | ||
| # Convention: each source contributes one entry per surface form. Do not | ||
| # concatenate surfaces into a single ``text`` stream — the gap-finder truncates | ||
| # each doc to ``max_doc_bytes=32_768`` and reports per-slice bpb, so mixing | ||
| # surfaces inside a slice loses the signal we want. | ||
| ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] = {} |
There was a problem hiding this comment.
🤖 Type the mapping concretely and trim the comment.
Two issues:
-
The docstring above says "values are
RawTextEvaluationDatasetinstances produced viamarin.evaluation.perplexity_gap.raw_text_dataset." The type annotationdict[str, Any]throws that signal away. Just importRawTextEvaluationDatasetand annotatedict[str, RawTextEvaluationDataset]. That gives the type checker and readers the real contract — and, importantly, if the first downloader PR sticks a rawstrpath in here by mistake, pyrefly catches it. -
The "Populated by follow-up PRs" / "Convention: each source contributes one entry per surface form…" commentary is explaining what a not-yet-written caller should do. That rationale belongs in the issue. The enforceable invariant ("one
RawTextEvaluationDatasetper surface form; do not concatenate surfaces within a slice") can live in the function docstring if it must, not as a comment on an empty dict.
Typed form (assuming you keep the skeleton — see top-level comment for the "maybe just delete" argument):
from marin.evaluation.perplexity_gap import RawTextEvaluationDataset
ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, RawTextEvaluationDataset] = {}| import posixpath | ||
| from typing import Any |
There was a problem hiding this comment.
🤖 Inconsistent with neighbors: os.path.join elsewhere.
experiments/paloma.py:86 and experiments/evals/exp1600_uncheatable_evals.py:84 build the same kind of slash-joined slice key with os.path.join. Switching to posixpath.join here is arguably more principled (slice names aren't filesystem paths, and os.path would produce \-separated keys on Windows — though nothing else in this codebase runs on Windows anyway). But it's inconsistent, and the inconsistency buys nothing concrete. Either:
- match the neighbors (use
os.path.join), or - file a follow-up to migrate
paloma.py/exp1600_uncheatable_evals.pytoposixpath.jointoo, so the convention is uniform.
Either is fine; just don't leave one island of posixpath sitting next to a sea of os.path.
| def raw_web_markup_raw_validation_sets() -> dict[str, Any]: | ||
| """Return raw-text eval slices covering web markup and image-adjacent text. | ||
|
|
||
| Slice names are prefixed with :data:`RAW_WEB_MARKUP_PREFIX`. Returns an | ||
| empty mapping until downloaders land in the follow-up PRs tracked by | ||
| #5056; callers should treat an empty result as "no raw-web-markup slices | ||
| are registered yet", not as an error. | ||
| """ | ||
| return { | ||
| posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset | ||
| for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items() | ||
| } |
There was a problem hiding this comment.
🤖 Function docstring says too much.
callers should treat an empty result as "no raw-web-markup slices are registered yet", not as an error.
An empty dict is already "no entries" — no caller interpretation needed. The .update(empty_dict) call site in defaults.py couldn't reasonably error on an empty mapping, so this disclaimer is defending against a scenario that can't occur.
Trim to:
def raw_web_markup_raw_validation_sets() -> dict[str, RawTextEvaluationDataset]:
"""Raw-text eval slices covering web markup and image-adjacent text, keyed by `raw_web_markup/<slice>`."""
return {
posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset
for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items()
}(Or os.path.join, per the other inline comment.)
| # Surface-preserving web/markup/image-text slices (#5056, parent #5005). | ||
| # Empty until downloaders land; see experiments/exp5056_raw_web_markup_ppl.py. | ||
| validation_sets.update(raw_web_markup_raw_validation_sets()) |
There was a problem hiding this comment.
🤖 Drop the two-line comment.
# Surface-preserving web/markup/image-text slices (#5056, parent #5005).
# Empty until downloaders land; see experiments/exp5056_raw_web_markup_ppl.py.
validation_sets.update(raw_web_markup_raw_validation_sets())Per CLAUDE.md:
Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123")
and:
Default to writing no comments. Only add one when the WHY is non-obvious
The function name raw_web_markup_raw_validation_sets already tells the reader what this adds. "Empty until downloaders land" is a statement about today's state that will be false tomorrow and, once it's false, nobody will come back to delete this comment — exactly the rot pattern the style guide warns about. Just call the function:
validation_sets.update(raw_web_markup_raw_validation_sets())…and let the reader click through for details.
|
🤖 Addressed review feedback and the scope correction. Changes in fcc6094:
Validation run:
|
Add a standalone typed helper module for raw web/markup/image-text perplexity-gap slice registration and prefixing. Keep default_raw_validation_sets() unchanged so these slices are not enabled by default, and add focused tests for helper behavior.
Part of #5056