Skip to content

[evals] Add diagnostic log-stream PPL eval sets #5093

@dlwh

Description

@dlwh

🤖 Part of #5005.

Description

Add held-out PPL/gap eval sets for diagnostic streams that agents read during debugging: CI logs, compiler/linker output, pytest failures, stack traces, package install errors, and system/application logs.

Initial sources:

Keep this separate from source-code evals. These are not code files; they are failure/diagnostic registers.

Definition of Done

  • Add raw eval dataset builders for at least two public log sources.
  • Add a small sanitized Marin-internal eval slice, or document why it is deferred.
  • Register slices outside the default validation set list.
  • Include worst-doc artifacts in gap reports.
  • Document leakage/contamination handling for any Marin-owned logs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions