Skip to content

[evals] Add SVG-backed raw web markup PPL slices#5130

Draft
dlwh wants to merge 3 commits intomainfrom
agent/20260423-fix-5056
Draft

[evals] Add SVG-backed raw web markup PPL slices#5130
dlwh wants to merge 3 commits intomainfrom
agent/20260423-fix-5056

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add a small first non-empty raw web/markup PPL cut by teaching the raw-text eval path to carry explicit Hugging Face splits and registering opt-in SVG-Stack validation and test slices. This keeps the work HF-backed, preserves exact SVG XML, and leaves default validation sets unchanged.

Part of #5005

github-actions Bot and others added 3 commits April 22, 2026 19:04
…5056)

Skeleton for #5056 (parent #5005). Adds an empty aggregator module for
surface-preserving PPL eval slices (HTML, WARC/WAT, web tables, SVG XML,
OCR strings, captions, alt-text) and wires it into
default_raw_validation_sets() so follow-up PRs can register downloaders
without touching the eval runner. Slices oversplit by surface form; see
module docstring for the design rationale.

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 3m 38s —— View job


PR Review

Summary. The PR's shape is right — HF-backed slices registered behind an opt-in helper, keeping defaults unchanged — but the split plumbing doesn't reach the dataset loader, so the new SVG slices will be silently skipped at runtime rather than producing PPL numbers. This needs a fix before merging; the existing tests don't catch it because they only verify object shape, not iteration.

🔴 Critical: split is set on HfDatasetSourceConfig but never consulted end-to-end

The chain is:

  1. raw_text_dataset(SVG_STACK_DATASET, split="val", ...)RawTextEvaluationDataset(split="val")
  2. _to_dataset_component (lib/marin/src/marin/evaluation/perplexity_gap.py:187-193) builds HfDatasetSourceConfig(id=..., splits=["val"])
  3. At runtime, iter_raw_text_documents (lib/levanter/src/levanter/analysis/perplexity_gap.py:434) calls component.source.get_shard_source("validation") — hardcoded.
  4. HfDatasetSourceConfig.get_shard_source treats splits as a filter: "validation" not in ["val"] → returns None.
  5. iter_raw_text_documents sees None and silently continues past the dataset.

Even if you dropped splits, step 3 would then call WrappedHFDataSource(id="starvector/svg-stack", split="validation"), and starvector/svg-stack only has val/test splits, so HF would error. So the current wiring is strictly worse than no-op — it looks configured but produces zero docs.

Fix direction: thread the configured split through DatasetComponent (e.g. an optional split attribute) and use it at the get_shard_source call site in levanter/analysis/perplexity_gap.py instead of the literal "validation". Add a test that invokes iter_raw_text_documents (or at least component.source.get_shard_source("validation")) so the next regression of this kind fails loudly.

🟡 Tests don't cover the real failure mode

test_svg_stack_slice_materializes_as_hf_dataset_component asserts component.source.splits == ["test"], which is the value you just set — it can't catch the mismatch with the hardcoded "validation" lookup at runtime. Extending the suite to exercise iter_raw_text_documents on an HF slice (or at minimum get_shard_source("validation") on a built component) would have caught this.

✅ Things I'd call out as correct

  • Making raw_web_markup_raw_validation_sets() opt-in instead of folding it into default_raw_validation_sets() is the right call — avoids perturbing unrelated gap runs while the slice is still a pilot.
  • The ValueError guardrail in raw_text_dataset rejecting non-validation splits for file-path sources is a nice bit of defensiveness.
  • Tags (raw_web_markup, issue:5056, source:svg_stack, surface:svg_xml, split:<x>) give you a clean way to slice reports afterwards; good.

🟢 Minor / stylistic

  • os.path.join for dictionary keys (svg_stack/svg_xml_val) matches existing paloma/uncheatable conventions, but these are logical keys, not paths — a literal "svg_stack/svg_xml_val" would read more clearly. Feel free to ignore; consistency with neighbors is also a valid choice.

Tasks

  • Read PR changes
  • Review perplexity_gap.py modifications
  • Review exp5056_raw_web_markup_ppl.py
  • Review test file
  • Verify the hardcoded-"validation" bug against actual levanter code
  • Post review
    · Branch

Comment on lines +65 to +75
def test_svg_stack_slice_materializes_as_hf_dataset_component() -> None:
datasets = raw_web_markup.raw_web_markup_raw_validation_sets()
component = _to_dataset_component(
datasets[os.path.join(raw_web_markup.RAW_WEB_MARKUP_PREFIX, "svg_stack", "svg_xml_test")]
)

assert isinstance(component.source, HfDatasetSourceConfig)
assert component.source.id == "starvector/svg-stack"
assert component.source.splits == ["test"]
assert component.format.text_key == "Svg"
assert component.tags == ["raw_web_markup", "issue:5056", "source:svg_stack", "surface:svg_xml", "split:test"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only asserts component.source.splits == ["test"], which is the value you set, not the value the gap finder actually uses. At runtime levanter/analysis/perplexity_gap.py:434 calls component.source.get_shard_source("validation") — the literal string — so with splits=["test"] (or ["val"]) this returns None and the slice is silently skipped.

Would be worth adding a test that invokes iter_raw_text_documents (or at least component.source.get_shard_source("validation")) against an HfDatasetSourceConfig built from a non-validation split, so the next regression of this kind fails loudly.

Comment on lines +43 to +58
ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, RawTextEvaluationDataset] = {
os.path.join("svg_stack", "svg_xml_val"): _hf_raw_web_markup_dataset(
SVG_STACK_DATASET,
text_key=SVG_TEXT_KEY,
split="val",
source_tag=SVG_STACK_SOURCE_TAG,
surface_tag=SVG_XML_SURFACE_TAG,
),
os.path.join("svg_stack", "svg_xml_test"): _hf_raw_web_markup_dataset(
SVG_STACK_DATASET,
text_key=SVG_TEXT_KEY,
split="test",
source_tag=SVG_STACK_SOURCE_TAG,
surface_tag=SVG_XML_SURFACE_TAG,
),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join here produces svg_stack/svg_xml_val on Linux, which matches the existing paloma/uncheatable conventions, so fine — but note these are logical dictionary keys, not filesystem paths. A literal "svg_stack/svg_xml_val" (or a small helper like _slice_key(source, name)) would read more clearly and wouldn't suggest filesystem semantics. Optional stylistic nit; feel free to ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant