Skip to content

[evals] Add GH Archive structured-output PPL evals#5119

Merged
dlwh merged 6 commits intomainfrom
agent/20260423-fix-5098
May 1, 2026
Merged

[evals] Add GH Archive structured-output PPL evals#5119
dlwh merged 6 commits intomainfrom
agent/20260423-fix-5098

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Normalize a small held-out GH Archive slice into stable masked JSON text records split by event type, and wire an opt-in raw validation registry plus perplexity-gap experiment entrypoint. Add synthetic tests for filtering, masking, serialization stability, optional WorkflowRun handling, and step hash coverage.

Fixes #5098

@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 7m 4s —— View job


Code review

🤖 No issues found. Checked for bugs and CLAUDE.md compliance.

Reviewed:

  • experiments/evals/gh_archive_structured_output.py
  • experiments/exp_model_perplexity_gap_gh_archive_structured_output.py
  • lib/marin/src/marin/datakit/download/gh_archive.py
  • tests/evals/test_gh_archive_structured_output.py

Against: root CLAUDE.md / AGENTS.md, experiments/AGENTS.md, lib/marin/AGENTS.md.

Candidate concerns that were investigated and dismissed:

  • make_gh_archive_step thin wrapper — matches the established pattern make_uncheatable_eval_step in the same directory; not an unjustified abstraction.

  • Dual-optional raw_root / gh_archive_raw params — the CLAUDE.md X | str rule targets a single union parameter, not two mutually-exclusive optionals.

  • Hash-attrs test — not tautological; it enforces that window/cap fields participate in the step's identity, so dropping any one would produce a real regression.

  • Masking determinism, idempotency, and cap enforcementsort_keys=True + sorted parse_qsl/urlencode + ExitStack-ordered atomic renames are internally consistent.

  • Check if PR needs review

  • Identify relevant CLAUDE.md/AGENTS.md files

  • Summarize PR changes

  • Run parallel review agents

  • Validate flagged issues

  • Post review findings
    Branch: agent/20260423-fix-5098

@dlwh dlwh requested a review from ravwojdyla May 1, 2026 00:38
Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but nothing that blocks

)


def make_gh_archive_step(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we redefining the function here, as opposed to using gh_archive_step and just calling as_executor_step on the result if/when we need it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed. I removed make_gh_archive_step and switched the eval module to call gh_archive_step(...).as_executor_step() directly at the dataset boundary.

resolved_event_types = tuple(dict.fromkeys(event_types))

def _run(output_path: str) -> dict[str, Any]:
cfg = GhArchiveDownloadConfig(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to update the agent skills to stop using this config dataclass pattern where it doesn't make sense

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Fixed. I dropped the local config dataclass and made download_gh_archive_events(...) take explicit keyword arguments, so gh_archive_step now just threads its own params straight through.


event_types = tuple(cfg.event_types)
counts = {event_type: 0 for event_type in event_types}
output_files = {event_type: _event_output_path(cfg.output_path, event_type) for event_type in event_types}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be more consistent with other pipelines to use zephyr to orchestrate the download and skipped files etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I kept the step itself sequential for now. This branch only stages a two-hour held-out window, writes one file per event type, and enforces stateful per-event-type caps across hours, so a real Zephyr fanout here would add an extra hour-shard/merge layer without changing the actual eval surface. I did take the narrower cleanup in this PR (no wrapper step/config and fsspec-based hourly reads). If we promote GH Archive beyond this tiny held-out slice, I agree the next version should fan out by hour via Zephyr.


def read_gh_archive_hour(url: str, timeout: int) -> Iterator[dict[str, Any]]:
"""Yield GH Archive events from one hourly ``.json.gz`` file."""
with requests.get(url, timeout=timeout, stream=True) as response:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could probably use fsspec here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Took this one. read_gh_archive_hour now uses open_url(..., timeout=...), so the hourly fetch path is on the same fsspec/open_url stack as the rest of the file.

@dlwh dlwh merged commit c360f46 into main May 1, 2026
36 checks passed
@dlwh dlwh deleted the agent/20260423-fix-5098 branch May 1, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[evals] Add GH Archive structured-output PPL evals

2 participants