[evals] Add capped issue 5005 gap-analysis launcher by dlwh · Pull Request #5123 · marin-community/marin

dlwh · 2026-04-23T16:17:55Z

Add a capped Marin 32B vs Qwen3 32B gap-analysis launcher that composes the first-wave log, diff/patch, robustness, ASR/OCR, and GH Archive eval slices into one combined report, and harden GH Archive URL normalization for malformed payloads seen in the run. Kept as draft because it depends on the child eval wiring PRs.

Part of #5005

Adds the diff_patch/<slice> namespace and an empty ACTIVE_DIFF_PATCH_DATASETS registry so slice builders for raw git diff, commit-msg-plus-diff, PR-review-plus-diff, and issue-to-patch sources can be populated in follow-ups. Mirrors the scaffolding pattern from #5084. Part of #5095

Register a metadata-only DIAGNOSTIC_LOGS family covering the failure registers agents read during debugging: GHALogs, LogChunks, LogHub (Apache / Linux / HDFS / OpenSSH / Thunderbird), pytest failures, compiler/linker output, exception stack traces, package-install errors, and a held-out sanitized Marin-internal slice. Slices use deterministic raw_root paths and stay outside default_raw_validation_sets, so they only enter gap reports when an experiment opts in via long_tail_raw_validation_sets(family=...). The Marin-internal slice carries an inline leakage policy: scrub PII, secrets, hostnames, bucket names, and tokens before mirroring; never include in any training mixture.

…rin32-qwen3-5093-5098

claude · 2026-04-23T16:18:39Z

Claude Code review in progress

Check PR status (skip conditions)
Find relevant CLAUDE.md/AGENTS.md files
Summarize PR changes
Run 4 parallel review agents (compliance + bugs)
Validate flagged issues
Post review summary or inline comments

View job run · branch codex/gap-marin32-qwen3-5093-5098

github-actions Bot and others added 20 commits April 22, 2026 23:49

[evals] Cover diagnostic-log opt-in and gap conversion

756f2a2

[evals] Add diff/patch raw slice builders and leakage checks

5279a99

[evals] Add held-out sample caps for diff/patch sources

9969ff9

[evals] Add capped ASR/OCR noisy-text PPL wiring

b5b9e68

[evals] Add GH Archive structured-output PPL eval wiring

87ccaf8

[evals] Add capped paired robustness PPL slices

b511150

[data] Add sample-capped diagnostic log sourcing for #5094

794b09b

[evals] Add capped diagnostic-log eval materializers

95b7dd6

Merge remote-tracking branch 'refs/remotes/pr/5103' into codex/gap-ma…

746b550

…rin32-qwen3-5093-5098

Merge remote-tracking branch 'refs/remotes/pr/5104' into codex/gap-ma…

8abbebf

…rin32-qwen3-5093-5098

Merge remote-tracking branch 'refs/remotes/pr/5118' into codex/gap-ma…

e7fd928

…rin32-qwen3-5093-5098

Merge remote-tracking branch 'refs/remotes/pr/5119' into codex/gap-ma…

1e5f26c

…rin32-qwen3-5093-5098

Merge remote-tracking branch 'refs/remotes/pr/5120' into codex/gap-ma…

0a53247

…rin32-qwen3-5093-5098

Merge remote-tracking branch 'refs/remotes/pr/5121' into codex/gap-ma…

8b0903e

…rin32-qwen3-5093-5098

Add capped 5005 gap run launcher

7140bcd

Fix capped gap run Flores materialization

c2d007c

Run capped gap eval as one combined report

24ea95f

Handle malformed GH Archive URL fields

a2f4b8c

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

dlwh mentioned this pull request Apr 23, 2026

[evals] Build pretraining checkpoint confidence portfolio #5005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add capped issue 5005 gap-analysis launcher#5123

[evals] Add capped issue 5005 gap-analysis launcher#5123
dlwh wants to merge 20 commits intomainfrom
codex/gap-marin32-qwen3-5093-5098

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code review in progress

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 23, 2026 •

edited

Loading