[evals] Add GH Archive structured-output PPL evals by dlwh · Pull Request #5119 · marin-community/marin

dlwh · 2026-04-23T08:44:45Z

Normalize a small held-out GH Archive slice into stable masked JSON text records split by event type, and wire an opt-in raw validation registry plus perplexity-gap experiment entrypoint. Add synthetic tests for filtering, masking, serialization stability, optional WorkflowRun handling, and step hash coverage.

Fixes #5098

claude · 2026-04-23T08:45:06Z

Claude finished @dlwh's task in 7m 4s —— View job

Code review

🤖 No issues found. Checked for bugs and CLAUDE.md compliance.

Reviewed:

experiments/evals/gh_archive_structured_output.py
experiments/exp_model_perplexity_gap_gh_archive_structured_output.py
lib/marin/src/marin/datakit/download/gh_archive.py
tests/evals/test_gh_archive_structured_output.py

Against: root CLAUDE.md / AGENTS.md, experiments/AGENTS.md, lib/marin/AGENTS.md.

Candidate concerns that were investigated and dismissed:

make_gh_archive_step thin wrapper — matches the established pattern make_uncheatable_eval_step in the same directory; not an unjustified abstraction.
Dual-optional raw_root / gh_archive_raw params — the CLAUDE.md X | str rule targets a single union parameter, not two mutually-exclusive optionals.
Hash-attrs test — not tautological; it enforces that window/cap fields participate in the step's identity, so dropping any one would produce a real regression.
Masking determinism, idempotency, and cap enforcement — sort_keys=True + sorted parse_qsl/urlencode + ExitStack-ordered atomic renames are internally consistent.
Check if PR needs review
Identify relevant CLAUDE.md/AGENTS.md files
Summarize PR changes
Run parallel review agents
Validate flagged issues
Post review findings
• Branch: agent/20260423-fix-5098

ravwojdyla

Some comments, but nothing that blocks

ravwojdyla · 2026-05-01T16:34:35Z

+    )
+
+
+def make_gh_archive_step(


why are we redefining the function here, as opposed to using gh_archive_step and just calling as_executor_step on the result if/when we need it?

🤖 Fixed. I removed make_gh_archive_step and switched the eval module to call gh_archive_step(...).as_executor_step() directly at the dataset boundary.

ravwojdyla · 2026-05-01T16:40:12Z

+    resolved_event_types = tuple(dict.fromkeys(event_types))
+
+    def _run(output_path: str) -> dict[str, Any]:
+        cfg = GhArchiveDownloadConfig(


I need to update the agent skills to stop using this config dataclass pattern where it doesn't make sense

🤖 Fixed. I dropped the local config dataclass and made download_gh_archive_events(...) take explicit keyword arguments, so gh_archive_step now just threads its own params straight through.

ravwojdyla · 2026-05-01T16:56:43Z

+
+    event_types = tuple(cfg.event_types)
+    counts = {event_type: 0 for event_type in event_types}
+    output_files = {event_type: _event_output_path(cfg.output_path, event_type) for event_type in event_types}


it would be more consistent with other pipelines to use zephyr to orchestrate the download and skipped files etc.

🤖 I kept the step itself sequential for now. This branch only stages a two-hour held-out window, writes one file per event type, and enforces stateful per-event-type caps across hours, so a real Zephyr fanout here would add an extra hour-shard/merge layer without changing the actual eval surface. I did take the narrower cleanup in this PR (no wrapper step/config and fsspec-based hourly reads). If we promote GH Archive beyond this tiny held-out slice, I agree the next version should fan out by hour via Zephyr.

ravwojdyla · 2026-05-01T16:58:03Z

+
+def read_gh_archive_hour(url: str, timeout: int) -> Iterator[dict[str, Any]]:
+    """Yield GH Archive events from one hourly ``.json.gz`` file."""
+    with requests.get(url, timeout=timeout, stream=True) as response:


nit: you could probably use fsspec here

🤖 Took this one. read_gh_archive_hour now uses open_url(..., timeout=...), so the hourly fetch path is on the same fsspec/open_url stack as the rest of the file.

[evals] Add GH Archive structured-output PPL eval wiring

87ccaf8

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

This was referenced Apr 23, 2026

[evals] Add GH Archive structured-output PPL evals #5098

Closed

[evals] Build pretraining checkpoint confidence portfolio #5005

Open

Use shared cache builder for GH Archive slices

77a0b6b

claude Bot mentioned this pull request Apr 30, 2026

[docs] Document perplexity-gap workflow after it stabilizes #5318

Open

Trim weak GH Archive structured output tests

6472c8f

dlwh requested a review from ravwojdyla May 1, 2026 00:38

ravwojdyla approved these changes May 1, 2026

View reviewed changes

dlwh added 3 commits May 1, 2026 13:52

Fix GH Archive test import order

b0bd929

Simplify GH Archive step wiring

d46ecec

Make GH Archive eval wiring StepSpec-native

82fab9a

dlwh merged commit c360f46 into main May 1, 2026
36 checks passed

dlwh deleted the agent/20260423-fix-5098 branch May 1, 2026 23:10

dlwh mentioned this pull request May 1, 2026

[data] Add GH Archive JSON events to pretraining data #5099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add GH Archive structured-output PPL evals#5119

[evals] Add GH Archive structured-output PPL evals#5119
dlwh merged 6 commits intomainfrom
agent/20260423-fix-5098

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

ravwojdyla left a comment

Uh oh!

ravwojdyla May 1, 2026

Uh oh!

dlwh May 1, 2026

Uh oh!

ravwojdyla May 1, 2026

Uh oh!

dlwh May 1, 2026

Uh oh!

ravwojdyla May 1, 2026

Uh oh!

dlwh May 1, 2026

Uh oh!

ravwojdyla May 1, 2026

Uh oh!

dlwh May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 23, 2026 •

edited

Loading