[evals] Add SVG-backed raw web markup PPL slices by dlwh · Pull Request #5130 · marin-community/marin

dlwh · 2026-04-23T16:52:49Z

Add a small first non-empty raw web/markup PPL cut by teaching the raw-text eval path to carry explicit Hugging Face splits and registering opt-in SVG-Stack validation and test slices. This keeps the work HF-backed, preserves exact SVG XML, and leaves default validation sets unchanged.

Part of #5005

…5056) Skeleton for #5056 (parent #5005). Adds an empty aggregator module for surface-preserving PPL eval slices (HTML, WARC/WAT, web tables, SVG XML, OCR strings, captions, alt-text) and wires it into default_raw_validation_sets() so follow-up PRs can register downloaders without touching the eval runner. Slices oversplit by surface form; see module docstring for the design rationale. Co-authored-by: David Hall <dlwh@users.noreply.github.com>

… sets

claude · 2026-04-23T16:53:16Z

Claude finished @dlwh's task in 3m 38s —— View job

PR Review

Summary. The PR's shape is right — HF-backed slices registered behind an opt-in helper, keeping defaults unchanged — but the split plumbing doesn't reach the dataset loader, so the new SVG slices will be silently skipped at runtime rather than producing PPL numbers. This needs a fix before merging; the existing tests don't catch it because they only verify object shape, not iteration.

🔴 Critical: `split` is set on `HfDatasetSourceConfig` but never consulted end-to-end

The chain is:

raw_text_dataset(SVG_STACK_DATASET, split="val", ...) → RawTextEvaluationDataset(split="val")
_to_dataset_component (lib/marin/src/marin/evaluation/perplexity_gap.py:187-193) builds HfDatasetSourceConfig(id=..., splits=["val"])
At runtime, iter_raw_text_documents (lib/levanter/src/levanter/analysis/perplexity_gap.py:434) calls component.source.get_shard_source("validation") — hardcoded.
HfDatasetSourceConfig.get_shard_source treats splits as a filter: "validation" not in ["val"] → returns None.
iter_raw_text_documents sees None and silently continues past the dataset.

Even if you dropped splits, step 3 would then call WrappedHFDataSource(id="starvector/svg-stack", split="validation"), and starvector/svg-stack only has val/test splits, so HF would error. So the current wiring is strictly worse than no-op — it looks configured but produces zero docs.

Fix direction: thread the configured split through DatasetComponent (e.g. an optional split attribute) and use it at the get_shard_source call site in levanter/analysis/perplexity_gap.py instead of the literal "validation". Add a test that invokes iter_raw_text_documents (or at least component.source.get_shard_source("validation")) so the next regression of this kind fails loudly.

🟡 Tests don't cover the real failure mode

test_svg_stack_slice_materializes_as_hf_dataset_component asserts component.source.splits == ["test"], which is the value you just set — it can't catch the mismatch with the hardcoded "validation" lookup at runtime. Extending the suite to exercise iter_raw_text_documents on an HF slice (or at minimum get_shard_source("validation") on a built component) would have caught this.

✅ Things I'd call out as correct

Making raw_web_markup_raw_validation_sets() opt-in instead of folding it into default_raw_validation_sets() is the right call — avoids perturbing unrelated gap runs while the slice is still a pilot.
The ValueError guardrail in raw_text_dataset rejecting non-validation splits for file-path sources is a nice bit of defensiveness.
Tags (raw_web_markup, issue:5056, source:svg_stack, surface:svg_xml, split:<x>) give you a clean way to slice reports afterwards; good.

🟢 Minor / stylistic

os.path.join for dictionary keys (svg_stack/svg_xml_val) matches existing paloma/uncheatable conventions, but these are logical keys, not paths — a literal "svg_stack/svg_xml_val" would read more clearly. Feel free to ignore; consistency with neighbors is also a valid choice.

Tasks

Read PR changes
Review perplexity_gap.py modifications
Review exp5056_raw_web_markup_ppl.py
Review test file
Verify the hardcoded-"validation" bug against actual levanter code
Post review
· Branch

claude · 2026-04-23T16:57:07Z

+def test_svg_stack_slice_materializes_as_hf_dataset_component() -> None:
+    datasets = raw_web_markup.raw_web_markup_raw_validation_sets()
+    component = _to_dataset_component(
+        datasets[os.path.join(raw_web_markup.RAW_WEB_MARKUP_PREFIX, "svg_stack", "svg_xml_test")]
+    )
+
+    assert isinstance(component.source, HfDatasetSourceConfig)
+    assert component.source.id == "starvector/svg-stack"
+    assert component.source.splits == ["test"]
+    assert component.format.text_key == "Svg"
+    assert component.tags == ["raw_web_markup", "issue:5056", "source:svg_stack", "surface:svg_xml", "split:test"]


This test only asserts component.source.splits == ["test"], which is the value you set, not the value the gap finder actually uses. At runtime levanter/analysis/perplexity_gap.py:434 calls component.source.get_shard_source("validation") — the literal string — so with splits=["test"] (or ["val"]) this returns None and the slice is silently skipped.

Would be worth adding a test that invokes iter_raw_text_documents (or at least component.source.get_shard_source("validation")) against an HfDatasetSourceConfig built from a non-validation split, so the next regression of this kind fails loudly.

claude · 2026-04-23T16:57:08Z

+ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, RawTextEvaluationDataset] = {
+    os.path.join("svg_stack", "svg_xml_val"): _hf_raw_web_markup_dataset(
+        SVG_STACK_DATASET,
+        text_key=SVG_TEXT_KEY,
+        split="val",
+        source_tag=SVG_STACK_SOURCE_TAG,
+        surface_tag=SVG_XML_SURFACE_TAG,
+    ),
+    os.path.join("svg_stack", "svg_xml_test"): _hf_raw_web_markup_dataset(
+        SVG_STACK_DATASET,
+        text_key=SVG_TEXT_KEY,
+        split="test",
+        source_tag=SVG_STACK_SOURCE_TAG,
+        surface_tag=SVG_XML_SURFACE_TAG,
+    ),
+}


os.path.join here produces svg_stack/svg_xml_val on Linux, which matches the existing paloma/uncheatable conventions, so fine — but note these are logical dictionary keys, not filesystem paths. A literal "svg_stack/svg_xml_val" (or a small helper like _slice_key(source, name)) would read more clearly and wouldn't suggest filesystem semantics. Optional stylistic nit; feel free to ignore.

github-actions Bot and others added 3 commits April 22, 2026 19:04

[evals] Keep raw-web-markup registry helper out of default validation…

fcc6094

… sets

[evals] Add SVG-backed raw web markup PPL slices

ca7a39c

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add SVG-backed raw web markup PPL slices#5130

[evals] Add SVG-backed raw web markup PPL slices#5130
dlwh wants to merge 3 commits intomainfrom
agent/20260423-fix-5056

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

🔴 Critical: split is set on HfDatasetSourceConfig but never consulted end-to-end

🟡 Tests don't cover the real failure mode

✅ Things I'd call out as correct

🟢 Minor / stylistic

Tasks

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 23, 2026 •

edited

Loading

🔴 Critical: `split` is set on `HfDatasetSourceConfig` but never consulted end-to-end