Skip to content

[evals] Add GH Archive structured-output PPL evals#5119

Open
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5098
Open

[evals] Add GH Archive structured-output PPL evals#5119
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5098

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Normalize a small held-out GH Archive slice into stable masked JSON text records split by event type, and wire an opt-in raw validation registry plus perplexity-gap experiment entrypoint. Add synthetic tests for filtering, masking, serialization stability, optional WorkflowRun handling, and step hash coverage.

Fixes #5098

@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 7m 4s —— View job


Code review

🤖 No issues found. Checked for bugs and CLAUDE.md compliance.

Reviewed:

  • experiments/evals/gh_archive_structured_output.py
  • experiments/exp_model_perplexity_gap_gh_archive_structured_output.py
  • lib/marin/src/marin/datakit/download/gh_archive.py
  • tests/evals/test_gh_archive_structured_output.py

Against: root CLAUDE.md / AGENTS.md, experiments/AGENTS.md, lib/marin/AGENTS.md.

Candidate concerns that were investigated and dismissed:

  • make_gh_archive_step thin wrapper — matches the established pattern make_uncheatable_eval_step in the same directory; not an unjustified abstraction.

  • Dual-optional raw_root / gh_archive_raw params — the CLAUDE.md X | str rule targets a single union parameter, not two mutually-exclusive optionals.

  • Hash-attrs test — not tautological; it enforces that window/cap fields participate in the step's identity, so dropping any one would produce a real regression.

  • Masking determinism, idempotency, and cap enforcementsort_keys=True + sorted parse_qsl/urlencode + ExitStack-ordered atomic renames are internally consistent.

  • Check if PR needs review

  • Identify relevant CLAUDE.md/AGENTS.md files

  • Summarize PR changes

  • Run parallel review agents

  • Validate flagged issues

  • Post review findings
    Branch: agent/20260423-fix-5098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[evals] Add GH Archive structured-output PPL evals

1 participant