🤖 Standalone data-ingestion task.
Description
Identify and prepare large public diagnostic-log corpora suitable for pretraining data, separate from held-out log-stream eval construction.
Candidate sources:
Do not train on held-out eval slices. Treat Marin-owned CI/Iris/Zephyr logs as eval-only until sanitization, governance, and leakage policy are agreed.
Definition of Done
- Produce a source inventory with license, size, format, and contamination risks.
- Define train/dev/test separation before ingest.
- Add sanitization rules for secrets, internal paths, tokens, and user identifiers.
- Propose a training-mixture tranche with byte/token estimates.
🤖 Standalone data-ingestion task.
Description
Identify and prepare large public diagnostic-log corpora suitable for pretraining data, separate from held-out log-stream eval construction.
Candidate sources:
Do not train on held-out eval slices. Treat Marin-owned CI/Iris/Zephyr logs as eval-only until sanitization, governance, and leakage policy are agreed.
Definition of Done