Skip to content

[data] Source public diagnostic logs for training data #5094

@dlwh

Description

@dlwh

🤖 Standalone data-ingestion task.

Description

Identify and prepare large public diagnostic-log corpora suitable for pretraining data, separate from held-out log-stream eval construction.

Candidate sources:

Do not train on held-out eval slices. Treat Marin-owned CI/Iris/Zephyr logs as eval-only until sanitization, governance, and leakage policy are agreed.

Definition of Done

  • Produce a source inventory with license, size, format, and contamination risks.
  • Define train/dev/test separation before ingest.
  • Add sanitization rules for secrets, internal paths, tokens, and user identifiers.
  • Propose a training-mixture tranche with byte/token estimates.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions