[data] Add sample-capped public diagnostic log sourcing#5121
[data] Add sample-capped public diagnostic log sourcing#5121
Conversation
|
Claude finished @dlwh's task in 6m 59s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. What was reviewed (4 parallel agents: 2× bug scan, 2× CLAUDE.md compliance):
Spot-checks that came back clean
Style-level items surfaced but not flagged (below the high-signal bar for this review, judgment calls you may still want to revisit):
· Branch |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 794b09b4f2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| fn=tokenize, | ||
| config=TokenizeConfig( | ||
| train_paths=[extracted_step.as_input_name() / "train/*.jsonl"], | ||
| validation_paths=versioned([extracted_step.as_input_name() / "dev/*.jsonl"]), |
There was a problem hiding this comment.
Remove versioned wrapper around InputName validation path
validation_paths is wrapped in versioned(...) even though it contains an InputName. During executor config instantiation, VersionedValue payloads are not recursively resolved, so this leaves an InputName object in TokenizeConfig.validation_paths at runtime. In tokenize, _expand_tokenize_paths expects strings and calls .endswith(...), so tokenize/all will fail once validation processing begins for this pipeline.
Useful? React with 👍 / 👎.
| "/log/", | ||
| "/logs/", |
There was a problem hiding this comment.
Match root-level log directories in path signal regex
The path signals include "/log/" and "/logs/", which only match when those segments are preceded by another /. Paths like logs/build.txt (repo-root log folders) will not match, so looks_like_diagnostic_log_row will incorrectly drop valid diagnostic rows even when content contains tracebacks/errors, reducing extraction coverage and skewing the sample.
Useful? React with 👍 / 👎.
Add #5094 training-data wiring for public diagnostic logs with a verified source inventory, license/provenance gating, deterministic train/dev/test plus issue_5093 holdout splitting, and sanitization rules. Add an opt-in experiment entry point that only runs against capped samples from pre-staged corpora and never triggers full-corpus pulls. Includes focused tests and docs for sizing and governance caveats.
Part of #5094