[data] Add sample-capped public diagnostic log sourcing by dlwh · Pull Request #5121 · marin-community/marin

dlwh · 2026-04-23T08:47:18Z

Add #5094 training-data wiring for public diagnostic logs with a verified source inventory, license/provenance gating, deterministic train/dev/test plus issue_5093 holdout splitting, and sanitization rules. Add an opt-in experiment entry point that only runs against capped samples from pre-staged corpora and never triggers full-corpus pulls. Includes focused tests and docs for sizing and governance caveats.

Part of #5094

claude · 2026-04-23T08:47:33Z

Claude finished @dlwh's task in 6m 59s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What was reviewed (4 parallel agents: 2× bug scan, 2× CLAUDE.md compliance):

docs/reports/diagnostic_logs_training_sources_5094.md
experiments/exp5094_public_diagnostic_logs.py
lib/marin/src/marin/datakit/download/diagnostic_logs.py
tests/datakit/download/test_diagnostic_logs.py

Spot-checks that came back clean

Redaction regex ordering: specific patterns (GitHub ghp_/github_pat_, AWS AKIA) precede the generic key=value secret rule; the KV rule uses \g<key> to preserve the key name.
Partition math: BLAKE2b → 10,000 buckets, carve-outs 100 / 100 / 100 / 9,700 = 1% issue_5093_holdout / 1% dev / 1% test / 97% train, matching the docstring and the policy doc.
Sample-cap enforcement: counters["seen_rows"] >= max_rows check is pre-increment, so exactly max_rows rows are admitted before breaking; outer loop exits cleanly when the inner budget is exhausted.
Imports all resolve in-tree: load_parquet_batched (marin.datakit.download.rollout_transforms), StepSpec (marin.execution.step_spec), fsspec_mkdirs (marin.utils), tokenize/TokenizeConfig (marin.processing.tokenize), llama3_tokenizer (experiments.llama).
executor_main(steps=..., description=...) — the first positional config arg is supplied by the @draccus.wrap() decorator at CLI parse time, not the caller.
InputName / str is a supported operator via InputName.__truediv__, so extracted_step.as_input_name() / "train/*.jsonl" is valid.
The three hidden CLI options (--dry_run, --executor_info_base_path, --prefix) follow prior art in experiments/pretraining_datasets/main.py for the experiment test harness.

Style-level items surfaced but not flagged (below the high-signal bar for this review, judgment calls you may still want to revisit):

The inventory tests (test_source_inventory_contains_required_candidates, test_training_ready_sources_are_opt_in_and_narrow) assert the license-gating policy. They look tautological in isolation but they are what catches a future regression where ghalogs/logchunks/loghub accidentally get flipped to TRAINING_READY — a governance contract, not a value-mirror.
counters / partition_counts as plain dicts, and starcoder_fixture_row_to_record returning dict[str, str]: AGENTS.md prefers dataclasses over raw dicts, but these are ephemeral local accumulators / JSONL payloads where a dataclass would only add indirection. Not a clear violation.

· Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 794b09b4f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T08:55:03Z

+        fn=tokenize,
+        config=TokenizeConfig(
+            train_paths=[extracted_step.as_input_name() / "train/*.jsonl"],
+            validation_paths=versioned([extracted_step.as_input_name() / "dev/*.jsonl"]),


Remove versioned wrapper around InputName validation path

validation_paths is wrapped in versioned(...) even though it contains an InputName. During executor config instantiation, VersionedValue payloads are not recursively resolved, so this leaves an InputName object in TokenizeConfig.validation_paths at runtime. In tokenize, _expand_tokenize_paths expects strings and calls .endswith(...), so tokenize/all will fail once validation processing begins for this pipeline.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T08:55:03Z

+    "/log/",
+    "/logs/",


Match root-level log directories in path signal regex

The path signals include "/log/" and "/logs/", which only match when those segments are preceded by another /. Paths like logs/build.txt (repo-root log folders) will not match, so looks_like_diagnostic_log_row will incorrectly drop valid diagnostic rows even when content contains tracebacks/errors, reducing extraction coverage and skewing the sample.

Useful? React with 👍 / 👎.

[data] Add sample-capped diagnostic log sourcing for #5094

794b09b

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

dlwh mentioned this pull request Apr 23, 2026

[data] Source public diagnostic logs for training data #5094

Open

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

dlwh mentioned this pull request Apr 23, 2026

[evals] Build pretraining checkpoint confidence portfolio #5005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add sample-capped public diagnostic log sourcing#5121

[data] Add sample-capped public diagnostic log sourcing#5121
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5094

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 23, 2026 •

edited

Loading