[evals] Add diagnostic log-stream PPL eval sets

🤖 Part of #5005.

## Description
Add held-out PPL/gap eval sets for diagnostic streams that agents read during debugging: CI logs, compiler/linker output, pytest failures, stack traces, package install errors, and system/application logs.

Initial sources:
- GHALogs: https://zenodo.org/records/14796970
- LogChunks: https://papertalk.org/papertalks/24713
- LogHub: https://github.com/logpai/loghub
- Sanitized Marin GitHub Actions/Iris/Zephyr logs as eval-only heldout data.

Keep this separate from source-code evals. These are not code files; they are failure/diagnostic registers.

### Definition of Done
- Add raw eval dataset builders for at least two public log sources.
- Add a small sanitized Marin-internal eval slice, or document why it is deferred.
- Register slices outside the default validation set list.
- Include worst-doc artifacts in gap reports.
- Document leakage/contamination handling for any Marin-owned logs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add diagnostic log-stream PPL eval sets #5093

Description

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[evals] Add diagnostic log-stream PPL eval sets #5093

Description

Description

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions