Skip to content

[data] Add sample-capped public diagnostic log sourcing#5121

Open
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5094
Open

[data] Add sample-capped public diagnostic log sourcing#5121
dlwh wants to merge 1 commit intomainfrom
agent/20260423-fix-5094

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add #5094 training-data wiring for public diagnostic logs with a verified source inventory, license/provenance gating, deterministic train/dev/test plus issue_5093 holdout splitting, and sanitization rules. Add an opt-in experiment entry point that only runs against capped samples from pre-staged corpora and never triggers full-corpus pulls. Includes focused tests and docs for sizing and governance caveats.

Part of #5094

@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 6m 59s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

What was reviewed (4 parallel agents: 2× bug scan, 2× CLAUDE.md compliance):

  • docs/reports/diagnostic_logs_training_sources_5094.md
  • experiments/exp5094_public_diagnostic_logs.py
  • lib/marin/src/marin/datakit/download/diagnostic_logs.py
  • tests/datakit/download/test_diagnostic_logs.py

Spot-checks that came back clean

  • Redaction regex ordering: specific patterns (GitHub ghp_/github_pat_, AWS AKIA) precede the generic key=value secret rule; the KV rule uses \g<key> to preserve the key name.
  • Partition math: BLAKE2b → 10,000 buckets, carve-outs 100 / 100 / 100 / 9,700 = 1% issue_5093_holdout / 1% dev / 1% test / 97% train, matching the docstring and the policy doc.
  • Sample-cap enforcement: counters["seen_rows"] >= max_rows check is pre-increment, so exactly max_rows rows are admitted before breaking; outer loop exits cleanly when the inner budget is exhausted.
  • Imports all resolve in-tree: load_parquet_batched (marin.datakit.download.rollout_transforms), StepSpec (marin.execution.step_spec), fsspec_mkdirs (marin.utils), tokenize/TokenizeConfig (marin.processing.tokenize), llama3_tokenizer (experiments.llama).
  • executor_main(steps=..., description=...) — the first positional config arg is supplied by the @draccus.wrap() decorator at CLI parse time, not the caller.
  • InputName / str is a supported operator via InputName.__truediv__, so extracted_step.as_input_name() / "train/*.jsonl" is valid.
  • The three hidden CLI options (--dry_run, --executor_info_base_path, --prefix) follow prior art in experiments/pretraining_datasets/main.py for the experiment test harness.

Style-level items surfaced but not flagged (below the high-signal bar for this review, judgment calls you may still want to revisit):

  • The inventory tests (test_source_inventory_contains_required_candidates, test_training_ready_sources_are_opt_in_and_narrow) assert the license-gating policy. They look tautological in isolation but they are what catches a future regression where ghalogs/logchunks/loghub accidentally get flipped to TRAINING_READY — a governance contract, not a value-mirror.
  • counters / partition_counts as plain dicts, and starcoder_fixture_row_to_record returning dict[str, str]: AGENTS.md prefers dataclasses over raw dicts, but these are ephemeral local accumulators / JSONL payloads where a dataclass would only add indirection. Not a clear violation.

· Branch

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 794b09b4f2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

fn=tokenize,
config=TokenizeConfig(
train_paths=[extracted_step.as_input_name() / "train/*.jsonl"],
validation_paths=versioned([extracted_step.as_input_name() / "dev/*.jsonl"]),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove versioned wrapper around InputName validation path

validation_paths is wrapped in versioned(...) even though it contains an InputName. During executor config instantiation, VersionedValue payloads are not recursively resolved, so this leaves an InputName object in TokenizeConfig.validation_paths at runtime. In tokenize, _expand_tokenize_paths expects strings and calls .endswith(...), so tokenize/all will fail once validation processing begins for this pipeline.

Useful? React with 👍 / 👎.

Comment on lines +47 to +48
"/log/",
"/logs/",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match root-level log directories in path signal regex

The path signals include "/log/" and "/logs/", which only match when those segments are preceded by another /. Paths like logs/build.txt (repo-root log folders) will not match, so looks_like_diagnostic_log_row will incorrectly drop valid diagnostic rows even when content contains tracebacks/errors, reducing extraction coverage and skewing the sample.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant