datakit: canonical source registry by ravwojdyla-agent · Pull Request #5105 · marin-community/marin

ravwojdyla-agent · 2026-04-22T23:50:18Z

adds marin.datakit.sources mirroring the 102 datasets in marin-community/token-counts; 3 entries currently commented out, leaving 99 active
DatakitSource frozen dataclass per entry: hf_dataset_id, revision, staged_path, data_subdir, id_field, text_field, file_extensions, rough_token_count_b
all_sources() returns the {name: DatakitSource} map over the active 99, pinned_sources() filters to the 97 entries with both revision and repo set — the subset the ferry can materialize ¹
staged paths cross-checked against gs://marin-us-central1/raw/: every active entry's path is present
common-pile entries use the _filtered HF variants Marin actually downloads (e.g. common-pile/arxiv_abstracts_filtered), not the user-facing display names in the token-counts CSV
no unit tests in this PR — every candidate assertion was tautological against the data definition
3 sources commented out until their staging is clean ², and finetranslations/* additionally has a TODO about splitting the two accounting halves by text_field/hf_urls_glob/data_subdir so they don't normalize to identical rows

2 of the 99 active entries are intentionally unpinned (hplt_v3, nsf_awards) with inline TODOs: hplt_v3 has staged data but no provenance.json to recover the revision, and its download_hplt_v3_step function was removed from the tree; nsf_awards is API-sourced and needs a bespoke download step. ↩
finetranslations/{multilingual,web} — staging at raw/finetranslations_d17a789b is still running, no provenance.json and no .executor_status=SUCCESS. common_corpus/english — raw/common_corpus_english-b78a5c1 is missing its .executor_status marker, so the run completion can't be confirmed. The integrity check that keeps this invariant enforced is in a follow-up workflow PR — it needs GCS auth so can't be a plain unit test. ↩

marin.datakit.sources defines DatakitSource (frozen dataclass capturing hf_dataset_id, revision, optional staged_path, schema hints) plus two cached factories: all_sources() for the full 102-entry set mirrored from marin-community/token-counts, and pinned_sources() for the 98 entries with a pinned revision and non-empty HF repo (the subset the ferry can actually materialize today). Four entries remain unpinned and carry inline TODOs: * finetranslations/{multilingual,web} — HF download module missing; the staged dir has no provenance.json to recover from. * hplt_v3 — download_hplt_v3_step was removed from the tree; staged dir has no provenance.json. * nsf_awards — API-sourced; needs a bespoke download step. Staging paths were cross-checked against gs://marin-us-central1/raw/: all 72 unique paths verified present. A follow-up PR will add a CI check to the datakit smoke workflow to keep this invariant enforced. Coverage: 109 offline tests exercise the schema, uniqueness of names, hex-SHA revision format, pinned-subset filtering, and dict-lookup semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The tests asserted invariants the data structure already guarantees: dict keys match source.name because the factory literally builds it that way, pinned_sources() contains only pinned entries because that's its filter, dataclass fields are non-empty because they're defaulted, etc. Per the repo's testing guidance, tests must fail on wrong behavior, not on implementation changes — these failed on either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7a17dbd79

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T23:53:56Z

+    ``revision``. Others (e.g. ``nsf_awards``, ``hplt_v3``) are carried
+    for completeness but need custom wiring before they'll ferry.
+    """
+    return {name: src for name, src in all_sources().items() if src.revision and src.hf_dataset_id}


Exclude transform-dependent sources from pinned list

pinned_sources() currently treats any entry with revision and hf_dataset_id as ferry-runnable, but several pinned entries in this registry (e.g. coderforge, gpt-oss-rollouts, superior-reasoning, swe-rebench-openhands, synthetic-1) require bespoke row_to_doc transforms in their download modules before text/id exist. Because these entries keep the default text_field="text" and point to raw staged paths, a generic normalize flow that trusts pinned_sources() will hit record[text_field] in normalize._make_normalize_fn and fail on raw schema. This means the function overstates what can be materialized "today" and should filter these out or carry explicit preprocessing/schema metadata.

Useful? React with 👍 / 👎.

Helw150 · 2026-04-23T00:00:10Z

+            hf_dataset_id="common-pile/arxiv_abstracts_filtered",
+            revision="f1d7a9a",
+            staged_path="raw/common_pile/arxiv_abstracts_filtered-f1d7a9a",
+            rough_token_count_b=0.54,


What are the rough token counts drawn from here?

(totally stolen) from your awesome report: https://huggingface.co/spaces/marin-community/token-count-viewer

* finetranslations/{multilingual,web} — staging at raw/finetranslations_d17a789b is still running; no provenance.json and no .executor_status=SUCCESS * common_corpus/english — raw/common_corpus_english-b78a5c1 is missing its .executor_status marker, so we can't confirm the run completed cleanly Commented out with TODOs to re-enable once staging is verified; keeps the registry structurally intact so the diff shows exactly which entries are paused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both /multilingual and /web point at the same parallel-corpus dump. They can't be re-enabled as-is without a distinguishing pattern (different text_field, hf_urls_glob, or data_subdir) or they'll normalize to identical rows and double-count the mixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ravwojdyla · 2026-04-23T00:26:06Z

@Helw150 I update this a little bit, disable 3 sources:

common_corpus_english-b78a5c1 - because it's missing executor_status
2 finetranslations - because status is RUNNING

Builds the per-source ferry DAG and training wrapper on top of the merged marin.datakit.sources registry (#5105): * experiments/datakit_testbed/settings.py — testbed-wide constants (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION, RAW_TARGET_TOTAL_TOKENS_B) * experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1 attr-file invariant holds without reading data * experiments/datakit_testbed/sampler.py — post-normalize by-provenance sampler (first K shards by filename; normalize's uniform partitioning makes this byte-fair and content-fair) * experiments/datakit_testbed/dag.py — wires download -> normalize -> sample -> noop_dedup -> consolidate with downloads grouped by (hf_id, revision, staged_path, urls_glob) * experiments/datakit_testbed/mixture.py — proportional mixture builder over tokenized caches, weighting by rough_token_count_b * experiments/datakit_testbed/train.py — Grug-MoE harness with simulated epoching (target_budget / experiment_budget on LmDataConfig) * experiments/ferries/datakit_testbed_ferry.py — entry point with us-central1 region guard 42 offline tests across DAG shape, sampler behavior, noop dedup end-to-end with consolidate, mixture arithmetic, and simulated-epoching budget math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 22, 2026

ravwojdyla-agent changed the title ~~Add canonical datakit source registry~~ datakit: canonical source registry Apr 22, 2026

ravwojdyla requested a review from Helw150 April 22, 2026 23:52

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

ravwojdyla-agent mentioned this pull request Apr 22, 2026

datakit-smoke: verify source staging in us-central1 #5106

Merged

Helw150 approved these changes Apr 23, 2026

View reviewed changes

ravwojdyla and others added 2 commits April 22, 2026 17:18

ravwojdyla merged commit f377f98 into main Apr 23, 2026
43 checks passed

ravwojdyla deleted the rav-datakit-sources branch April 23, 2026 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datakit: canonical source registry#5105

datakit: canonical source registry#5105
ravwojdyla merged 4 commits intomainfrom
rav-datakit-sources

ravwojdyla-agent commented Apr 22, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

Helw150 Apr 23, 2026

Uh oh!

ravwojdyla Apr 23, 2026 •

edited

Loading

Uh oh!

ravwojdyla commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ravwojdyla-agent commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Helw150 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ravwojdyla commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ravwojdyla-agent commented Apr 22, 2026 •

edited

Loading

ravwojdyla Apr 23, 2026 •

edited

Loading