Skip to content

[evals] Add raw web/markup/image-text PPL registration helper#5084

Open
dlwh wants to merge 2 commits intomainfrom
claude/issue-5056-20260422-1858
Open

[evals] Add raw web/markup/image-text PPL registration helper#5084
dlwh wants to merge 2 commits intomainfrom
claude/issue-5056-20260422-1858

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 22, 2026

Add a standalone typed helper module for raw web/markup/image-text perplexity-gap slice registration and prefixing. Keep default_raw_validation_sets() unchanged so these slices are not enabled by default, and add focused tests for helper behavior.

Part of #5056

…5056)

Skeleton for #5056 (parent #5005). Adds an empty aggregator module for
surface-preserving PPL eval slices (HTML, WARC/WAT, web tables, SVG XML,
OCR strings, captions, alt-text) and wires it into
default_raw_validation_sets() so follow-up PRs can register downloaders
without touching the eval runner. Slices oversplit by surface form; see
module docstring for the design rationale.

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 22, 2026

Claude finished @dlwh's task in 2m 59s —— View job


Review


Behavioral correctness is fine — the wiring into default_raw_validation_sets() is a no-op today (.update({})), so no runtime risk. My concerns are all about whether this PR should exist in its current form, and about comment/doc style.

Main concern: empty skeleton vs. land-with-first-downloader

The PR is a 70-line registration point that currently registers nothing. ACTIVE_RAW_WEB_MARKUP_DATASETS is {}, so the aggregator returns {} and default_raw_validation_sets().update({}) is a no-op.

CLAUDE.md is explicit:

Don't add features, refactor, or introduce abstractions beyond what the task requires. […] Don't design for hypothetical future requirements. […] No half-finished implementations either.

The stated justification — "keep perplexity-gap runners unchanged while follow-up downloader PRs register surface-form slices incrementally" — isn't very load-bearing: the wiring in defaults.py is exactly two lines (one import, one .update(...)), so having the first downloader PR add both the module and those two lines would not meaningfully increase its diff or review friction.

I'd suggest closing this PR and folding its content into the first downloader PR, so reviewers can see the registration point exercised by at least one real entry. If you do keep it as a skeleton, the module should shrink substantially (see inline comments).

Secondary: comments / docstrings violate CLAUDE.md

Multiple places reference the current task / parent issue numbers and describe hypothetical future PRs. CLAUDE.md:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.

Specifically:

  • experiments/exp5056_raw_web_markup_ppl.py module docstring opens with #5056 / Parent: #5005 and refers to "tracked in [evals] Add raw web, markup, and image-text PPL slices #5056", "follow-up PRs", "design review" — all issue-ephemeral.
  • # Top-level constant per CLAUDE.md. is a meta-comment citing the style guide; drop it.
  • The comment block above ACTIVE_RAW_WEB_MARKUP_DATASETS explains what "follow-up PRs" will do with it.
  • The comment in defaults.py:319-320 is the same rot pattern ("Empty until downloaders land…").

Typing

ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] throws away the contract the docstring states (values are RawTextEvaluationDataset). Just import the class and annotate it concretely — pyrefly will then catch a mis-typed entry in the first downloader PR.

Minor: posixpath vs os.path

posixpath.join here is inconsistent with os.path.join used in sibling files (experiments/paloma.py:86, experiments/evals/exp1600_uncheatable_evals.py:84). Match the neighbors, or migrate them too — don't leave one island.

Naming

exp5056_raw_web_markup_ppl.py encodes an issue number into the filename. The codebase already has precedent for this (exp1600_uncheatable_evals.py), so it's defensible; I'd personally drop the exp5056_ prefix since, again, the issue number will outlive its usefulness, but this is the existing convention so leaving as-is is fine.

Inline comments

Six inline comments posted on specific lines with concrete suggestions.


· Branch: claude/issue-5056-20260422-1858

Comment on lines +1 to +66
# Copyright The Marin Authors
# SPDX-License-Identifier: Apache-2.0

"""
#5056: Raw web, markup, and image-text PPL slices.

Parent: #5005.

Byte-level perplexity-gap eval slices that preserve surface syntax normally
stripped by cleaned web corpora (HTML, WARC/WAT metadata, web tables, SVG XML,
OCR strings, captions, alt-text, EXIF-like metadata, URL-heavy records). The
goal is to surface perplexity-gap buckets that cleaned-corpus slices
(Paloma / uncheatable-eval) hide.

Targets (tracked in #5056):
- Common Crawl WARC/WAT: raw HTTP headers, raw HTML, WAT JSON.
- Web Data Commons Web Tables: raw <table> HTML plus extracted JSON metadata.
- SVG-Stack: SVG XML programs and captions.
- TextOCR / OCR-VQA: OCR strings and scene-text annotations.
- LAION metadata: URL / alt-text / EXIF-like fields — deferred pending
explicit subset selection and safety-filter review.

Per #5056 design review, slices are oversplit by surface form (one entry per
surface form, e.g. `raw_web_markup/cc_warc_html`, `raw_web_markup/cc_wat_json`)
so the gap-finder bucket analysis in `marin/evaluation/perplexity_gap.py`
stays clean. Grouping happens post-hoc via tags.

This module is intentionally a registration point: downloaders land in
follow-up PRs and populate ``ACTIVE_RAW_WEB_MARKUP_DATASETS``. The aggregator
``raw_web_markup_raw_validation_sets()`` is wired into
``experiments/defaults.py::default_raw_validation_sets()`` so new slices flow
into ``exp_model_perplexity_gap_marin_vs_llama.py`` and its siblings without
touching any other file.
"""

import posixpath
from typing import Any

# Prefix applied to every slice name so the gap-finder report groups the new
# surface-preserving slices together. Top-level constant per CLAUDE.md.
RAW_WEB_MARKUP_PREFIX = "raw_web_markup"

# Populated by follow-up PRs. Keys are slice names relative to
# ``RAW_WEB_MARKUP_PREFIX`` (e.g. ``cc_warc_html``); values are
# ``RawTextEvaluationDataset`` instances produced via
# ``marin.evaluation.perplexity_gap.raw_text_dataset``.
#
# Convention: each source contributes one entry per surface form. Do not
# concatenate surfaces into a single ``text`` stream — the gap-finder truncates
# each doc to ``max_doc_bytes=32_768`` and reports per-slice bpb, so mixing
# surfaces inside a slice loses the signal we want.
ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] = {}


def raw_web_markup_raw_validation_sets() -> dict[str, Any]:
"""Return raw-text eval slices covering web markup and image-adjacent text.

Slice names are prefixed with :data:`RAW_WEB_MARKUP_PREFIX`. Returns an
empty mapping until downloaders land in the follow-up PRs tracked by
#5056; callers should treat an empty result as "no raw-web-markup slices
are registered yet", not as an error.
"""
return {
posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset
for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Empty registration point — consider landing with the first downloader instead.

The module doesn't do anything yet: ACTIVE_RAW_WEB_MARKUP_DATASETS is {}, so raw_web_markup_raw_validation_sets() always returns {}, and the .update({}) call in defaults.py is a no-op.

CLAUDE.md is explicit on this:

Don't add features, refactor, or introduce abstractions beyond what the task requires. […] Don't design for hypothetical future requirements. […] No half-finished implementations either.

and:

Delete dead code: unused parameters, stale options, old experiments.

The argument for this PR is "wire up the registration point now so follow-ups can register without touching defaults.py". But the wiring is two lines — one import, one .update(...). That's not reviewer-friction worth saving by pre-landing a 70-line skeleton. The follow-up PRs would touch defaults.py anyway only if this PR weren't there, which is fine. I'd rather see this module appear in the same PR as the first downloader, so it's reviewable against concrete usage (you can see at least one populated entry, confirm the prefix / naming convention matches the gap-finder's expectations, etc.).

If you do want to keep a scaffold, it should be as minimal as possible (no module-level mutable ACTIVE_… dict, no posixpath.join over an empty mapping, no docstring paragraphs that will rot before they're used).

Comment on lines +4 to +34
"""
#5056: Raw web, markup, and image-text PPL slices.

Parent: #5005.

Byte-level perplexity-gap eval slices that preserve surface syntax normally
stripped by cleaned web corpora (HTML, WARC/WAT metadata, web tables, SVG XML,
OCR strings, captions, alt-text, EXIF-like metadata, URL-heavy records). The
goal is to surface perplexity-gap buckets that cleaned-corpus slices
(Paloma / uncheatable-eval) hide.

Targets (tracked in #5056):
- Common Crawl WARC/WAT: raw HTTP headers, raw HTML, WAT JSON.
- Web Data Commons Web Tables: raw <table> HTML plus extracted JSON metadata.
- SVG-Stack: SVG XML programs and captions.
- TextOCR / OCR-VQA: OCR strings and scene-text annotations.
- LAION metadata: URL / alt-text / EXIF-like fields — deferred pending
explicit subset selection and safety-filter review.

Per #5056 design review, slices are oversplit by surface form (one entry per
surface form, e.g. `raw_web_markup/cc_warc_html`, `raw_web_markup/cc_wat_json`)
so the gap-finder bucket analysis in `marin/evaluation/perplexity_gap.py`
stays clean. Grouping happens post-hoc via tags.

This module is intentionally a registration point: downloaders land in
follow-up PRs and populate ``ACTIVE_RAW_WEB_MARKUP_DATASETS``. The aggregator
``raw_web_markup_raw_validation_sets()`` is wired into
``experiments/defaults.py::default_raw_validation_sets()`` so new slices flow
into ``exp_model_perplexity_gap_marin_vs_llama.py`` and its siblings without
touching any other file.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Docstring references tracking issues that will rot.

CLAUDE.md:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123"), since those belong in the PR description and rot as the codebase evolves.

This docstring opens with #5056: … / Parent: #5005, then refers back to "#5056 design review", "tracked in #5056", and "follow-up PRs". Once those issues are closed or renumbered, the docstring becomes misleading.

A module docstring should describe what the module is in the current codebase (one or two sentences about surface-preserving PPL slices, the prefix convention, and where the dataset values come from). Targets and rationale belong in the issue, not here.

Comment on lines +39 to +41
# Prefix applied to every slice name so the gap-finder report groups the new
# surface-preserving slices together. Top-level constant per CLAUDE.md.
RAW_WEB_MARKUP_PREFIX = "raw_web_markup"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Drop "per CLAUDE.md" meta-comment.

# Top-level constant per CLAUDE.md. explains why the code is written this way in this repo, not anything about the constant itself. Style-guide adherence is not a code comment; a future reader doesn't need (or want) a citation of the style guide inline with the declaration. Just:

RAW_WEB_MARKUP_PREFIX = "raw_web_markup"

The name and value are self-explanatory; no comment needed.

Comment on lines +43 to +52
# Populated by follow-up PRs. Keys are slice names relative to
# ``RAW_WEB_MARKUP_PREFIX`` (e.g. ``cc_warc_html``); values are
# ``RawTextEvaluationDataset`` instances produced via
# ``marin.evaluation.perplexity_gap.raw_text_dataset``.
#
# Convention: each source contributes one entry per surface form. Do not
# concatenate surfaces into a single ``text`` stream — the gap-finder truncates
# each doc to ``max_doc_bytes=32_768`` and reports per-slice bpb, so mixing
# surfaces inside a slice loses the signal we want.
ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, Any] = {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Type the mapping concretely and trim the comment.

Two issues:

  1. The docstring above says "values are RawTextEvaluationDataset instances produced via marin.evaluation.perplexity_gap.raw_text_dataset." The type annotation dict[str, Any] throws that signal away. Just import RawTextEvaluationDataset and annotate dict[str, RawTextEvaluationDataset]. That gives the type checker and readers the real contract — and, importantly, if the first downloader PR sticks a raw str path in here by mistake, pyrefly catches it.

  2. The "Populated by follow-up PRs" / "Convention: each source contributes one entry per surface form…" commentary is explaining what a not-yet-written caller should do. That rationale belongs in the issue. The enforceable invariant ("one RawTextEvaluationDataset per surface form; do not concatenate surfaces within a slice") can live in the function docstring if it must, not as a comment on an empty dict.

Typed form (assuming you keep the skeleton — see top-level comment for the "maybe just delete" argument):

from marin.evaluation.perplexity_gap import RawTextEvaluationDataset

ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, RawTextEvaluationDataset] = {}

Comment on lines +36 to +37
import posixpath
from typing import Any
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Inconsistent with neighbors: os.path.join elsewhere.

experiments/paloma.py:86 and experiments/evals/exp1600_uncheatable_evals.py:84 build the same kind of slash-joined slice key with os.path.join. Switching to posixpath.join here is arguably more principled (slice names aren't filesystem paths, and os.path would produce \-separated keys on Windows — though nothing else in this codebase runs on Windows anyway). But it's inconsistent, and the inconsistency buys nothing concrete. Either:

  • match the neighbors (use os.path.join), or
  • file a follow-up to migrate paloma.py / exp1600_uncheatable_evals.py to posixpath.join too, so the convention is uniform.

Either is fine; just don't leave one island of posixpath sitting next to a sea of os.path.

Comment on lines +55 to +66
def raw_web_markup_raw_validation_sets() -> dict[str, Any]:
"""Return raw-text eval slices covering web markup and image-adjacent text.

Slice names are prefixed with :data:`RAW_WEB_MARKUP_PREFIX`. Returns an
empty mapping until downloaders land in the follow-up PRs tracked by
#5056; callers should treat an empty result as "no raw-web-markup slices
are registered yet", not as an error.
"""
return {
posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset
for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Function docstring says too much.

callers should treat an empty result as "no raw-web-markup slices are registered yet", not as an error.

An empty dict is already "no entries" — no caller interpretation needed. The .update(empty_dict) call site in defaults.py couldn't reasonably error on an empty mapping, so this disclaimer is defending against a scenario that can't occur.

Trim to:

def raw_web_markup_raw_validation_sets() -> dict[str, RawTextEvaluationDataset]:
    """Raw-text eval slices covering web markup and image-adjacent text, keyed by `raw_web_markup/<slice>`."""
    return {
        posixpath.join(RAW_WEB_MARKUP_PREFIX, slice_name): dataset
        for slice_name, dataset in ACTIVE_RAW_WEB_MARKUP_DATASETS.items()
    }

(Or os.path.join, per the other inline comment.)

Comment thread experiments/defaults.py Outdated
Comment on lines +319 to +321
# Surface-preserving web/markup/image-text slices (#5056, parent #5005).
# Empty until downloaders land; see experiments/exp5056_raw_web_markup_ppl.py.
validation_sets.update(raw_web_markup_raw_validation_sets())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Drop the two-line comment.

# Surface-preserving web/markup/image-text slices (#5056, parent #5005).
# Empty until downloaders land; see experiments/exp5056_raw_web_markup_ppl.py.
validation_sets.update(raw_web_markup_raw_validation_sets())

Per CLAUDE.md:

Don't reference the current task, fix, or callers ("used by X", "added for the Y flow", "handles the case from issue #123")

and:

Default to writing no comments. Only add one when the WHY is non-obvious

The function name raw_web_markup_raw_validation_sets already tells the reader what this adds. "Empty until downloaders land" is a statement about today's state that will be false tomorrow and, once it's false, nobody will come back to delete this comment — exactly the rot pattern the style guide warns about. Just call the function:

validation_sets.update(raw_web_markup_raw_validation_sets())

…and let the reader click through for details.

@dlwh dlwh changed the title [evals] Add raw web/markup/image-text PPL slice registration point [evals] Add raw web/markup/image-text PPL registration helper Apr 22, 2026
@dlwh
Copy link
Copy Markdown
Member Author

dlwh commented Apr 22, 2026

🤖 Addressed review feedback and the scope correction.

Changes in fcc6094:

  • Removed the default_raw_validation_sets() wiring, so raw-web-markup slices are not enabled by default.
  • Kept experiments/exp5056_raw_web_markup_ppl.py as a standalone typed helper/registry module and trimmed issue-ephemeral commentary.
  • Added focused tests in tests/evals/test_exp5056_raw_web_markup_ppl.py for prefixing and registry behavior.

Validation run:

  • uv run pytest tests/evals/test_exp5056_raw_web_markup_ppl.py -q
  • uv run pytest tests/test_dry_run.py -k "exp_model_perplexity_gap_marin_vs_llama" -q
  • PYTHONPATH=. uv run python smoke check that default_raw_validation_sets() has no raw_web_markup/*
  • ./infra/pre-commit.py --all-files --fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant