Skip to content

[evals] Add capped issue 5005 gap-analysis launcher#5123

Draft
dlwh wants to merge 20 commits intomainfrom
codex/gap-marin32-qwen3-5093-5098
Draft

[evals] Add capped issue 5005 gap-analysis launcher#5123
dlwh wants to merge 20 commits intomainfrom
codex/gap-marin32-qwen3-5093-5098

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add a capped Marin 32B vs Qwen3 32B gap-analysis launcher that composes the first-wave log, diff/patch, robustness, ASR/OCR, and GH Archive eval slices into one combined report, and harden GH Archive URL normalization for malformed payloads seen in the run. Kept as draft because it depends on the child eval wiring PRs.

Part of #5005

github-actions Bot and others added 20 commits April 22, 2026 23:49
Adds the diff_patch/<slice> namespace and an empty ACTIVE_DIFF_PATCH_DATASETS
registry so slice builders for raw git diff, commit-msg-plus-diff,
PR-review-plus-diff, and issue-to-patch sources can be populated in follow-ups.
Mirrors the scaffolding pattern from #5084.

Part of #5095
Register a metadata-only DIAGNOSTIC_LOGS family covering the failure
registers agents read during debugging: GHALogs, LogChunks, LogHub
(Apache / Linux / HDFS / OpenSSH / Thunderbird), pytest failures,
compiler/linker output, exception stack traces, package-install
errors, and a held-out sanitized Marin-internal slice.

Slices use deterministic raw_root paths and stay outside
default_raw_validation_sets, so they only enter gap reports when an
experiment opts in via long_tail_raw_validation_sets(family=...).
The Marin-internal slice carries an inline leakage policy: scrub
PII, secrets, hostnames, bucket names, and tokens before mirroring;
never include in any training mixture.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude Code review in progress

  • Check PR status (skip conditions)
  • Find relevant CLAUDE.md/AGENTS.md files
  • Summarize PR changes
  • Run 4 parallel review agents (compliance + bugs)
  • Validate flagged issues
  • Post review summary or inline comments

View job run · branch codex/gap-marin32-qwen3-5093-5098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant