[evals] Add raw web/markup/image-text PPL registration helper by dlwh · Pull Request #5084 · marin-community/marin

dlwh · 2026-04-22T21:56:07Z

Add a standalone typed helper module for raw web/markup/image-text perplexity-gap slice registration and prefixing. Keep default_raw_validation_sets() unchanged so these slices are not enabled by default, and add focused tests for helper behavior.

Part of #5056

…5056) Skeleton for #5056 (parent #5005). Adds an empty aggregator module for surface-preserving PPL eval slices (HTML, WARC/WAT, web tables, SVG XML, OCR strings, captions, alt-text) and wires it into default_raw_validation_sets() so follow-up PRs can register downloaders without touching the eval runner. Slices oversplit by surface form; see module docstring for the design rationale. Co-authored-by: David Hall <dlwh@users.noreply.github.com>

claude · 2026-04-22T21:56:48Z

Claude finished @dlwh's task in 2m 59s —— View job

Review

Read PR diff and changed files
Review context (related modules, issue [evals] Add raw web, markup, and image-text PPL slices #5056)
Analyze for correctness, style, and CLAUDE.md compliance
Post review feedback

Behavioral correctness is fine — the wiring into default_raw_validation_sets() is a no-op today (.update({})), so no runtime risk. My concerns are all about whether this PR should exist in its current form, and about comment/doc style.

Main concern: empty skeleton vs. land-with-first-downloader

The PR is a 70-line registration point that currently registers nothing. ACTIVE_RAW_WEB_MARKUP_DATASETS is {}, so the aggregator returns {} and default_raw_validation_sets().update({}) is a no-op.

CLAUDE.md is explicit:

Don't add features, refactor, or introduce abstractions beyond what the task requires. […] Don't design for hypothetical future requirements. […] No half-finished implementations either.

The stated justification — "keep perplexity-gap runners unchanged while follow-up downloader PRs register surface-form slices incrementally" — isn't very load-bearing: the wiring in defaults.py is exactly two lines (one import, one .update(...)), so having the first downloader PR add both the module and those two lines would not meaningfully increase its diff or review friction.

I'd suggest closing this PR and folding its content into the first downloader PR, so reviewers can see the registration point exercised by at least one real entry. If you do keep it as a skeleton, the module should shrink substantially (see inline comments).

Secondary: comments / docstrings violate CLAUDE.md

Multiple places reference the current task / parent issue numbers and describe hypothetical future PRs. CLAUDE.md:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.

Specifically:

experiments/exp5056_raw_web_markup_ppl.py module docstring opens with #5056 / Parent: #5005 and refers to "tracked in [evals] Add raw web, markup, and image-text PPL slices #5056", "follow-up PRs", "design review" — all issue-ephemeral.
# Top-level constant per CLAUDE.md. is a meta-comment citing the style guide; drop it.
The comment block above ACTIVE_RAW_WEB_MARKUP_DATASETS explains what "follow-up PRs" will do with it.
The comment in defaults.py:319-320 is the same rot pattern ("Empty until downloaders land…").

Typing

ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] throws away the contract the docstring states (values are RawTextEvaluationDataset). Just import the class and annotate it concretely — pyrefly will then catch a mis-typed entry in the first downloader PR.

Minor: `posixpath` vs `os.path`

posixpath.join here is inconsistent with os.path.join used in sibling files (experiments/paloma.py:86, experiments/evals/exp1600_uncheatable_evals.py:84). Match the neighbors, or migrate them too — don't leave one island.

Naming

exp5056_raw_web_markup_ppl.py encodes an issue number into the filename. The codebase already has precedent for this (exp1600_uncheatable_evals.py), so it's defensible; I'd personally drop the exp5056_ prefix since, again, the issue number will outlive its usefulness, but this is the existing convention so leaving as-is is fine.

Inline comments

Six inline comments posted on specific lines with concrete suggestions.

· Branch: claude/issue-5056-20260422-1858

claude · 2026-04-22T21:59:59Z

+# Copyright The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+"""
+#5056: Raw web, markup, and image-text PPL slices.
+
+Parent: #5005.
+
+Byte-level perplexity-gap eval slices that preserve surface syntax normally
+stripped by cleaned web corpora (HTML, WARC/WAT metadata, web tables, SVG XML,
+OCR strings, captions, alt-text, EXIF-like metadata, URL-heavy records). The
+goal is to surface perplexity-gap buckets that cleaned-corpus slices
+(Paloma / uncheatable-eval) hide.
+
+Targets (tracked in #5056):
+  - Common Crawl WARC/WAT: raw HTTP headers, raw HTML, WAT JSON.
+  - Web Data Commons Web Tables: raw <table> HTML plus extracted JSON metadata.
+  - SVG-Stack: SVG XML programs and captions.
+  - TextOCR / OCR-VQA: OCR strings and scene-text annotations.
+  - LAION metadata: URL / alt-text / EXIF-like fields — deferred pending
+    explicit subset selection and safety-filter review.
+
+Per #5056 design review, slices are oversplit by surface form (one entry per
+surface form, e.g. `raw_web_markup/cc_warc_html`, `raw_web_markup/cc_wat_json`)
+so the gap-finder bucket analysis in `marin/evaluation/perplexity_gap.py`
+stays clean. Grouping happens post-hoc via tags.
+
+This module is intentionally a registration point: downloaders land in
+follow-up PRs and populate ``ACTIVE_RAW_WEB_MARKUP_DATASETS``. The aggregator
+``raw_web_markup_raw_validation_sets()`` is wired into
+``experiments/defaults.py::default_raw_validation_sets()`` so new slices flow
+into ``exp_model_perplexity_gap_marin_vs_llama.py`` and its siblings without
+touching any other file.
+"""
+
+import posixpath
+from typing import Any
+
+# Prefix applied to every slice name so the gap-finder report groups the new
+# surface-preserving slices together. Top-level constant per CLAUDE.md.
+RAW_WEB_MARKUP_PREFIX = "raw_web_markup"
+
+# Populated by follow-up PRs. Keys are slice names relative to
+# ``RAW_WEB_MARKUP_PREFIX`` (e.g. ``cc_warc_html``); values are
+# ``RawTextEvaluationDataset`` instances produced via
+# ``marin.evaluation.perplexity_gap.raw_text_dataset``.
+#
+# Convention: each source contributes one entry per surface form. Do not
+# concatenate surfaces into a single ``text`` stream — the gap-finder truncates
+# each doc to ``max_doc_bytes=32_768`` and reports per-slice bpb, so mixing
+# surfaces inside a slice loses the signal we want.
+ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] = {}
+
+
+def raw_web_markup_raw_validation_sets() -> dict[str, Any]:
+    """Return raw-text eval slices covering web markup and image-adjacent text.
+
+    Slice names are prefixed with :data:`RAW_WEB_MARKUP_PREFIX`. Returns an
+    empty mapping until downloaders land in the follow-up PRs tracked by
+    #5056; callers should treat an empty result as "no raw-web-markup slices
+    are registered yet", not as an error.
+    """
+    return {
+        posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset
+        for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items()
+    }


🤖 Empty registration point — consider landing with the first downloader instead.

The module doesn't do anything yet: ACTIVE_RAW_WEB_MARKUP_DATASETS is {}, so raw_web_markup_raw_validation_sets() always returns {}, and the .update({}) call in defaults.py is a no-op.

CLAUDE.md is explicit on this:

Don't add features, refactor, or introduce abstractions beyond what the task requires. […] Don't design for hypothetical future requirements. […] No half-finished implementations either.

and:

Delete dead code: unused parameters, stale options, old experiments.

The argument for this PR is "wire up the registration point now so follow-ups can register without touching defaults.py". But the wiring is two lines — one import, one .update(...). That's not reviewer-friction worth saving by pre-landing a 70-line skeleton. The follow-up PRs would touch defaults.py anyway only if this PR weren't there, which is fine. I'd rather see this module appear in the same PR as the first downloader, so it's reviewable against concrete usage (you can see at least one populated entry, confirm the prefix / naming convention matches the gap-finder's expectations, etc.).

If you do want to keep a scaffold, it should be as minimal as possible (no module-level mutable ACTIVE_… dict, no posixpath.join over an empty mapping, no docstring paragraphs that will rot before they're used).

claude · 2026-04-22T22:00:00Z

+"""
+#5056: Raw web, markup, and image-text PPL slices.
+
+Parent: #5005.
+
+Byte-level perplexity-gap eval slices that preserve surface syntax normally
+stripped by cleaned web corpora (HTML, WARC/WAT metadata, web tables, SVG XML,
+OCR strings, captions, alt-text, EXIF-like metadata, URL-heavy records). The
+goal is to surface perplexity-gap buckets that cleaned-corpus slices
+(Paloma / uncheatable-eval) hide.
+
+Targets (tracked in #5056):
+  - Common Crawl WARC/WAT: raw HTTP headers, raw HTML, WAT JSON.
+  - Web Data Commons Web Tables: raw <table> HTML plus extracted JSON metadata.
+  - SVG-Stack: SVG XML programs and captions.
+  - TextOCR / OCR-VQA: OCR strings and scene-text annotations.
+  - LAION metadata: URL / alt-text / EXIF-like fields — deferred pending
+    explicit subset selection and safety-filter review.
+
+Per #5056 design review, slices are oversplit by surface form (one entry per
+surface form, e.g. `raw_web_markup/cc_warc_html`, `raw_web_markup/cc_wat_json`)
+so the gap-finder bucket analysis in `marin/evaluation/perplexity_gap.py`
+stays clean. Grouping happens post-hoc via tags.
+
+This module is intentionally a registration point: downloaders land in
+follow-up PRs and populate ``ACTIVE_RAW_WEB_MARKUP_DATASETS``. The aggregator
+``raw_web_markup_raw_validation_sets()`` is wired into
+``experiments/defaults.py::default_raw_validation_sets()`` so new slices flow
+into ``exp_model_perplexity_gap_marin_vs_llama.py`` and its siblings without
+touching any other file.
+"""


🤖 Docstring references tracking issues that will rot.

CLAUDE.md:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.

This docstring opens with #5056: … / Parent: #5005, then refers back to "#5056 design review", "tracked in #5056", and "follow-up PRs". Once those issues are closed or renumbered, the docstring becomes misleading.

A module docstring should describe what the module is in the current codebase (one or two sentences about surface-preserving PPL slices, the prefix convention, and where the dataset values come from). Targets and rationale belong in the issue, not here.

claude · 2026-04-22T22:00:00Z

+# Prefix applied to every slice name so the gap-finder report groups the new
+# surface-preserving slices together. Top-level constant per CLAUDE.md.
+RAW_WEB_MARKUP_PREFIX = "raw_web_markup"


🤖 Drop "per CLAUDE.md" meta-comment.

# Top-level constant per CLAUDE.md. explains why the code is written this way in this repo, not anything about the constant itself. Style-guide adherence is not a code comment; a future reader doesn't need (or want) a citation of the style guide inline with the declaration. Just:

RAW_WEB_MARKUP_PREFIX = "raw_web_markup"

The name and value are self-explanatory; no comment needed.

claude · 2026-04-22T22:00:01Z

+# Populated by follow-up PRs. Keys are slice names relative to
+# ``RAW_WEB_MARKUP_PREFIX`` (e.g. ``cc_warc_html``); values are
+# ``RawTextEvaluationDataset`` instances produced via
+# ``marin.evaluation.perplexity_gap.raw_text_dataset``.
+#
+# Convention: each source contributes one entry per surface form. Do not
+# concatenate surfaces into a single ``text`` stream — the gap-finder truncates
+# each doc to ``max_doc_bytes=32_768`` and reports per-slice bpb, so mixing
+# surfaces inside a slice loses the signal we want.
+ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] = {}


🤖 Type the mapping concretely and trim the comment.

Two issues:

The docstring above says "values are RawTextEvaluationDataset instances produced via marin.evaluation.perplexity_gap.raw_text_dataset." The type annotation dict[str, Any] throws that signal away. Just import RawTextEvaluationDataset and annotate dict[str, RawTextEvaluationDataset]. That gives the type checker and readers the real contract — and, importantly, if the first downloader PR sticks a raw str path in here by mistake, pyrefly catches it.

The "Populated by follow-up PRs" / "Convention: each source contributes one entry per surface form…" commentary is explaining what a not-yet-written caller should do. That rationale belongs in the issue. The enforceable invariant ("one RawTextEvaluationDataset per surface form; do not concatenate surfaces within a slice") can live in the function docstring if it must, not as a comment on an empty dict.

Typed form (assuming you keep the skeleton — see top-level comment for the "maybe just delete" argument):

from marin.evaluation.perplexity_gap import RawTextEvaluationDataset ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, RawTextEvaluationDataset] = {}

claude · 2026-04-22T22:00:02Z

+import posixpath
+from typing import Any


🤖 Inconsistent with neighbors: os.path.join elsewhere.

experiments/paloma.py:86 and experiments/evals/exp1600_uncheatable_evals.py:84 build the same kind of slash-joined slice key with os.path.join. Switching to posixpath.join here is arguably more principled (slice names aren't filesystem paths, and os.path would produce \-separated keys on Windows — though nothing else in this codebase runs on Windows anyway). But it's inconsistent, and the inconsistency buys nothing concrete. Either:

match the neighbors (use os.path.join), or

file a follow-up to migrate paloma.py / exp1600_uncheatable_evals.py to posixpath.join too, so the convention is uniform.

Either is fine; just don't leave one island of posixpath sitting next to a sea of os.path.

claude · 2026-04-22T22:00:02Z

+def raw_web_markup_raw_validation_sets() -> dict[str, Any]:
+    """Return raw-text eval slices covering web markup and image-adjacent text.
+
+    Slice names are prefixed with :data:`RAW_WEB_MARKUP_PREFIX`. Returns an
+    empty mapping until downloaders land in the follow-up PRs tracked by
+    #5056; callers should treat an empty result as "no raw-web-markup slices
+    are registered yet", not as an error.
+    """
+    return {
+        posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset
+        for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items()
+    }


🤖 Function docstring says too much.

callers should treat an empty result as "no raw-web-markup slices are registered yet", not as an error.

An empty dict is already "no entries" — no caller interpretation needed. The .update(empty_dict) call site in defaults.py couldn't reasonably error on an empty mapping, so this disclaimer is defending against a scenario that can't occur.

Trim to:

def raw_web_markup_raw_validation_sets() -> dict[str, RawTextEvaluationDataset]: """Raw-text eval slices covering web markup and image-adjacent text, keyed by `raw_web_markup/<slice>`.""" return { posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items() }

(Or os.path.join, per the other inline comment.)

claude · 2026-04-22T22:00:03Z

+    # Surface-preserving web/markup/image-text slices (#5056, parent #5005).
+    # Empty until downloaders land; see experiments/exp5056_raw_web_markup_ppl.py.
+    validation_sets.update(raw_web_markup_raw_validation_sets())


🤖 Drop the two-line comment.

# Surface-preserving web/markup/image-text slices (#5056, parent #5005). # Empty until downloaders land; see experiments/exp5056_raw_web_markup_ppl.py. validation_sets.update(raw_web_markup_raw_validation_sets())

Per CLAUDE.md:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123")

and:

Default to writing no comments. Only add one when the WHY is non-obvious

The function name raw_web_markup_raw_validation_sets already tells the reader what this adds. "Empty until downloaders land" is a statement about today's state that will be false tomorrow and, once it's false, nobody will come back to delete this comment — exactly the rot pattern the style guide warns about. Just call the function:

validation_sets.update(raw_web_markup_raw_validation_sets())

…and let the reader click through for details.

… sets

dlwh · 2026-04-22T22:09:17Z

🤖 Addressed review feedback and the scope correction.

Changes in fcc6094:

Removed the default_raw_validation_sets() wiring, so raw-web-markup slices are not enabled by default.
Kept experiments/exp5056_raw_web_markup_ppl.py as a standalone typed helper/registry module and trimmed issue-ephemeral commentary.
Added focused tests in tests/evals/test_exp5056_raw_web_markup_ppl.py for prefixing and registry behavior.

Validation run:

uv run pytest tests/evals/test_exp5056_raw_web_markup_ppl.py -q
uv run pytest tests/test_dry_run.py -k "exp_model_perplexity_gap_marin_vs_llama" -q
PYTHONPATH=. uv run python smoke check that default_raw_validation_sets() has no raw_web_markup/*
./infra/pre-commit.py --all-files --fix

dlwh added the agent-generated Created by automation/agent label Apr 22, 2026

dlwh mentioned this pull request Apr 22, 2026

[evals] Add raw web, markup, and image-text PPL slices #5056

Open

claude Bot reviewed Apr 22, 2026

View reviewed changes

[evals] Keep raw-web-markup registry helper out of default validation…

fcc6094

… sets

dlwh changed the title ~~[evals] Add raw web/markup/image-text PPL slice registration point~~ [evals] Add raw web/markup/image-text PPL registration helper Apr 22, 2026

This was referenced Apr 22, 2026

[evals] Add diff and patch PPL eval sets #5095

Open

[evals] Add diff/patch PPL registration helper #5103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add raw web/markup/image-text PPL registration helper#5084

[evals] Add raw web/markup/image-text PPL registration helper#5084
dlwh wants to merge 2 commits intomainfrom
claude/issue-5056-20260422-1858

dlwh commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

claude Bot Apr 22, 2026

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Main concern: empty skeleton vs. land-with-first-downloader

Secondary: comments / docstrings violate CLAUDE.md

Typing

Minor: posixpath vs os.path

Naming

Inline comments

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

dlwh commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dlwh commented Apr 22, 2026 •

edited

Loading

claude Bot commented Apr 22, 2026 •

edited

Loading

Minor: `posixpath` vs `os.path`