[evals] Add capped issue 5005 gap-analysis launcher#5123
Draft
Conversation
Register a metadata-only DIAGNOSTIC_LOGS family covering the failure registers agents read during debugging: GHALogs, LogChunks, LogHub (Apache / Linux / HDFS / OpenSSH / Thunderbird), pytest failures, compiler/linker output, exception stack traces, package-install errors, and a held-out sanitized Marin-internal slice. Slices use deterministic raw_root paths and stay outside default_raw_validation_sets, so they only enter gap reports when an experiment opts in via long_tail_raw_validation_sets(family=...). The Marin-internal slice carries an inline leakage policy: scrub PII, secrets, hostnames, bucket names, and tokens before mirroring; never include in any training mixture.
…rin32-qwen3-5093-5098
…rin32-qwen3-5093-5098
…rin32-qwen3-5093-5098
…rin32-qwen3-5093-5098
…rin32-qwen3-5093-5098
…rin32-qwen3-5093-5098
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Add a capped Marin 32B vs Qwen3 32B gap-analysis launcher that composes the first-wave log, diff/patch, robustness, ASR/OCR, and GH Archive eval slices into one combined report, and harden GH Archive URL normalization for malformed payloads seen in the run. Kept as draft because it depends on the child eval wiring PRs.
Part of #5005