[evals] Add SVG-backed raw web markup PPL slices#5130
Conversation
…5056) Skeleton for #5056 (parent #5005). Adds an empty aggregator module for surface-preserving PPL eval slices (HTML, WARC/WAT, web tables, SVG XML, OCR strings, captions, alt-text) and wires it into default_raw_validation_sets() so follow-up PRs can register downloaders without touching the eval runner. Slices oversplit by surface form; see module docstring for the design rationale. Co-authored-by: David Hall <dlwh@users.noreply.github.com>
|
Claude finished @dlwh's task in 3m 38s —— View job PR ReviewSummary. The PR's shape is right — HF-backed slices registered behind an opt-in helper, keeping defaults unchanged — but the 🔴 Critical:
|
| def test_svg_stack_slice_materializes_as_hf_dataset_component() -> None: | ||
| datasets = raw_web_markup.raw_web_markup_raw_validation_sets() | ||
| component = _to_dataset_component( | ||
| datasets[os.path.join(raw_web_markup.RAW_WEB_MARKUP_PREFIX, "svg_stack", "svg_xml_test")] | ||
| ) | ||
|
|
||
| assert isinstance(component.source, HfDatasetSourceConfig) | ||
| assert component.source.id == "starvector/svg-stack" | ||
| assert component.source.splits == ["test"] | ||
| assert component.format.text_key == "Svg" | ||
| assert component.tags == ["raw_web_markup", "issue:5056", "source:svg_stack", "surface:svg_xml", "split:test"] |
There was a problem hiding this comment.
This test only asserts component.source.splits == ["test"], which is the value you set, not the value the gap finder actually uses. At runtime levanter/analysis/perplexity_gap.py:434 calls component.source.get_shard_source("validation") — the literal string — so with splits=["test"] (or ["val"]) this returns None and the slice is silently skipped.
Would be worth adding a test that invokes iter_raw_text_documents (or at least component.source.get_shard_source("validation")) against an HfDatasetSourceConfig built from a non-validation split, so the next regression of this kind fails loudly.
| ACTIVE_RAW_WEB_MARKUP_DATASETS: dict[str, RawTextEvaluationDataset] = { | ||
| os.path.join("svg_stack", "svg_xml_val"): _hf_raw_web_markup_dataset( | ||
| SVG_STACK_DATASET, | ||
| text_key=SVG_TEXT_KEY, | ||
| split="val", | ||
| source_tag=SVG_STACK_SOURCE_TAG, | ||
| surface_tag=SVG_XML_SURFACE_TAG, | ||
| ), | ||
| os.path.join("svg_stack", "svg_xml_test"): _hf_raw_web_markup_dataset( | ||
| SVG_STACK_DATASET, | ||
| text_key=SVG_TEXT_KEY, | ||
| split="test", | ||
| source_tag=SVG_STACK_SOURCE_TAG, | ||
| surface_tag=SVG_XML_SURFACE_TAG, | ||
| ), | ||
| } |
There was a problem hiding this comment.
os.path.join here produces svg_stack/svg_xml_val on Linux, which matches the existing paloma/uncheatable conventions, so fine — but note these are logical dictionary keys, not filesystem paths. A literal "svg_stack/svg_xml_val" (or a small helper like _slice_key(source, name)) would read more clearly and wouldn't suggest filesystem semantics. Optional stylistic nit; feel free to ignore.
Add a small first non-empty raw web/markup PPL cut by teaching the raw-text eval path to carry explicit Hugging Face splits and registering opt-in SVG-Stack validation and test slices. This keeps the work HF-backed, preserves exact SVG XML, and leaves default validation sets unchanged.
Part of #5005