🤖 Part of #5005.
Description
Add held-out PPL/gap eval sets for diagnostic streams that agents read during debugging: CI logs, compiler/linker output, pytest failures, stack traces, package install errors, and system/application logs.
Initial sources:
Keep this separate from source-code evals. These are not code files; they are failure/diagnostic registers.
Definition of Done
- Add raw eval dataset builders for at least two public log sources.
- Add a small sanitized Marin-internal eval slice, or document why it is deferred.
- Register slices outside the default validation set list.
- Include worst-doc artifacts in gap reports.
- Document leakage/contamination handling for any Marin-owned logs.
🤖 Part of #5005.
Description
Add held-out PPL/gap eval sets for diagnostic streams that agents read during debugging: CI logs, compiler/linker output, pytest failures, stack traces, package install errors, and system/application logs.
Initial sources:
Keep this separate from source-code evals. These are not code files; they are failure/diagnostic registers.
Definition of Done