[evals] Add GH Archive structured-output PPL evals#5119
Conversation
|
Claude finished @dlwh's task in 7m 4s —— View job Code review🤖 No issues found. Checked for bugs and CLAUDE.md compliance. Reviewed:
Against: root Candidate concerns that were investigated and dismissed:
|
ravwojdyla
left a comment
There was a problem hiding this comment.
Some comments, but nothing that blocks
| ) | ||
|
|
||
|
|
||
| def make_gh_archive_step( |
There was a problem hiding this comment.
why are we redefining the function here, as opposed to using gh_archive_step and just calling as_executor_step on the result if/when we need it?
There was a problem hiding this comment.
🤖 Fixed. I removed make_gh_archive_step and switched the eval module to call gh_archive_step(...).as_executor_step() directly at the dataset boundary.
| resolved_event_types = tuple(dict.fromkeys(event_types)) | ||
|
|
||
| def _run(output_path: str) -> dict[str, Any]: | ||
| cfg = GhArchiveDownloadConfig( |
There was a problem hiding this comment.
I need to update the agent skills to stop using this config dataclass pattern where it doesn't make sense
There was a problem hiding this comment.
🤖 Fixed. I dropped the local config dataclass and made download_gh_archive_events(...) take explicit keyword arguments, so gh_archive_step now just threads its own params straight through.
|
|
||
| event_types = tuple(cfg.event_types) | ||
| counts = {event_type: 0 for event_type in event_types} | ||
| output_files = {event_type: _event_output_path(cfg.output_path, event_type) for event_type in event_types} |
There was a problem hiding this comment.
it would be more consistent with other pipelines to use zephyr to orchestrate the download and skipped files etc.
There was a problem hiding this comment.
🤖 I kept the step itself sequential for now. This branch only stages a two-hour held-out window, writes one file per event type, and enforces stateful per-event-type caps across hours, so a real Zephyr fanout here would add an extra hour-shard/merge layer without changing the actual eval surface. I did take the narrower cleanup in this PR (no wrapper step/config and fsspec-based hourly reads). If we promote GH Archive beyond this tiny held-out slice, I agree the next version should fan out by hour via Zephyr.
|
|
||
| def read_gh_archive_hour(url: str, timeout: int) -> Iterator[dict[str, Any]]: | ||
| """Yield GH Archive events from one hourly ``.json.gz`` file.""" | ||
| with requests.get(url, timeout=timeout, stream=True) as response: |
There was a problem hiding this comment.
nit: you could probably use fsspec here
There was a problem hiding this comment.
🤖 Took this one. read_gh_archive_hour now uses open_url(..., timeout=...), so the hourly fetch path is on the same fsspec/open_url stack as the rest of the file.
Normalize a small held-out GH Archive slice into stable masked JSON text records split by event type, and wire an opt-in raw validation registry plus perplexity-gap experiment entrypoint. Add synthetic tests for filtering, masking, serialization stability, optional WorkflowRun handling, and step hash coverage.
Fixes #5098