Skip to content

datakit: canonical source registry#5105

Merged
ravwojdyla merged 4 commits intomainfrom
rav-datakit-sources
Apr 23, 2026
Merged

datakit: canonical source registry#5105
ravwojdyla merged 4 commits intomainfrom
rav-datakit-sources

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 22, 2026

  • adds marin.datakit.sources mirroring the 102 datasets in marin-community/token-counts; 3 entries currently commented out, leaving 99 active
  • DatakitSource frozen dataclass per entry: hf_dataset_id, revision, staged_path, data_subdir, id_field, text_field, file_extensions, rough_token_count_b
  • all_sources() returns the {name: DatakitSource} map over the active 99, pinned_sources() filters to the 97 entries with both revision and repo set — the subset the ferry can materialize 1
  • staged paths cross-checked against gs://marin-us-central1/raw/: every active entry's path is present
  • common-pile entries use the _filtered HF variants Marin actually downloads (e.g. common-pile/arxiv_abstracts_filtered), not the user-facing display names in the token-counts CSV
  • no unit tests in this PR — every candidate assertion was tautological against the data definition
  • 3 sources commented out until their staging is clean 2, and finetranslations/* additionally has a TODO about splitting the two accounting halves by text_field/hf_urls_glob/data_subdir so they don't normalize to identical rows

Footnotes

  1. 2 of the 99 active entries are intentionally unpinned (hplt_v3, nsf_awards) with inline TODOs: hplt_v3 has staged data but no provenance.json to recover the revision, and its download_hplt_v3_step function was removed from the tree; nsf_awards is API-sourced and needs a bespoke download step.

  2. finetranslations/{multilingual,web} — staging at raw/finetranslations_d17a789b is still running, no provenance.json and no .executor_status=SUCCESS. common_corpus/englishraw/common_corpus_english-b78a5c1 is missing its .executor_status marker, so the run completion can't be confirmed. The integrity check that keeps this invariant enforced is in a follow-up workflow PR — it needs GCS auth so can't be a plain unit test.

marin.datakit.sources defines DatakitSource (frozen dataclass capturing
hf_dataset_id, revision, optional staged_path, schema hints) plus two
cached factories: all_sources() for the full 102-entry set mirrored from
marin-community/token-counts, and pinned_sources() for the 98 entries
with a pinned revision and non-empty HF repo (the subset the ferry can
actually materialize today).

Four entries remain unpinned and carry inline TODOs:
* finetranslations/{multilingual,web} — HF download module missing; the
  staged dir has no provenance.json to recover from.
* hplt_v3 — download_hplt_v3_step was removed from the tree; staged dir
  has no provenance.json.
* nsf_awards — API-sourced; needs a bespoke download step.

Staging paths were cross-checked against gs://marin-us-central1/raw/:
all 72 unique paths verified present. A follow-up PR will add a CI check
to the datakit smoke workflow to keep this invariant enforced.

Coverage: 109 offline tests exercise the schema, uniqueness of names,
hex-SHA revision format, pinned-subset filtering, and dict-lookup
semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 22, 2026
The tests asserted invariants the data structure already guarantees: dict
keys match source.name because the factory literally builds it that way,
pinned_sources() contains only pinned entries because that's its filter,
dataclass fields are non-empty because they're defaulted, etc. Per the
repo's testing guidance, tests must fail on wrong behavior, not on
implementation changes — these failed on either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla-agent ravwojdyla-agent changed the title Add canonical datakit source registry datakit: canonical source registry Apr 22, 2026
@ravwojdyla ravwojdyla requested a review from Helw150 April 22, 2026 23:52
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7a17dbd79

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

``revision``. Others (e.g. ``nsf_awards``, ``hplt_v3``) are carried
for completeness but need custom wiring before they'll ferry.
"""
return {name: src for name, src in all_sources().items() if src.revision and src.hf_dataset_id}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude transform-dependent sources from pinned list

pinned_sources() currently treats any entry with revision and hf_dataset_id as ferry-runnable, but several pinned entries in this registry (e.g. coderforge, gpt-oss-rollouts, superior-reasoning, swe-rebench-openhands, synthetic-1) require bespoke row_to_doc transforms in their download modules before text/id exist. Because these entries keep the default text_field="text" and point to raw staged paths, a generic normalize flow that trusts pinned_sources() will hit record[text_field] in normalize._make_normalize_fn and fail on raw schema. This means the function overstates what can be materialized "today" and should filter these out or carry explicit preprocessing/schema metadata.

Useful? React with 👍 / 👎.

hf_dataset_id="common-pile/arxiv_abstracts_filtered",
revision="f1d7a9a",
staged_path="raw/common_pile/arxiv_abstracts_filtered-f1d7a9a",
rough_token_count_b=0.54,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the rough token counts drawn from here?

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(totally stolen) from your awesome report: https://huggingface.co/spaces/marin-community/token-count-viewer

ravwojdyla and others added 2 commits April 22, 2026 17:18
* finetranslations/{multilingual,web} — staging at raw/finetranslations_d17a789b
  is still running; no provenance.json and no .executor_status=SUCCESS
* common_corpus/english — raw/common_corpus_english-b78a5c1 is missing its
  .executor_status marker, so we can't confirm the run completed cleanly

Commented out with TODOs to re-enable once staging is verified; keeps the
registry structurally intact so the diff shows exactly which entries are
paused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both /multilingual and /web point at the same parallel-corpus dump. They
can't be re-enabled as-is without a distinguishing pattern (different
text_field, hf_urls_glob, or data_subdir) or they'll normalize to identical
rows and double-count the mixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla
Copy link
Copy Markdown
Contributor

@Helw150 I update this a little bit, disable 3 sources:

  • common_corpus_english-b78a5c1 - because it's missing executor_status
  • 2 finetranslations - because status is RUNNING

@ravwojdyla ravwojdyla merged commit f377f98 into main Apr 23, 2026
43 checks passed
@ravwojdyla ravwojdyla deleted the rav-datakit-sources branch April 23, 2026 00:26
ravwojdyla added a commit that referenced this pull request Apr 25, 2026
Builds the per-source ferry DAG and training wrapper on top of the
merged marin.datakit.sources registry (#5105):

* experiments/datakit_testbed/settings.py — testbed-wide constants
  (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION,
  RAW_TARGET_TOTAL_TOKENS_B)
* experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for
  fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1
  attr-file invariant holds without reading data
* experiments/datakit_testbed/sampler.py — post-normalize by-provenance
  sampler (first K shards by filename; normalize's uniform partitioning
  makes this byte-fair and content-fair)
* experiments/datakit_testbed/dag.py — wires
  download -> normalize -> sample -> noop_dedup -> consolidate
  with downloads grouped by (hf_id, revision, staged_path, urls_glob)
* experiments/datakit_testbed/mixture.py — proportional mixture builder
  over tokenized caches, weighting by rough_token_count_b
* experiments/datakit_testbed/train.py — Grug-MoE harness with simulated
  epoching (target_budget / experiment_budget on LmDataConfig)
* experiments/ferries/datakit_testbed_ferry.py — entry point with
  us-central1 region guard

42 offline tests across DAG shape, sampler behavior, noop dedup
end-to-end with consolidate, mixture arithmetic, and simulated-epoching
budget math.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 25, 2026
Builds the per-source ferry DAG and training wrapper on top of the
merged marin.datakit.sources registry (#5105):

* experiments/datakit_testbed/settings.py — testbed-wide constants
  (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION,
  RAW_TARGET_TOTAL_TOKENS_B)
* experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for
  fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1
  attr-file invariant holds without reading data
* experiments/datakit_testbed/sampler.py — post-normalize by-provenance
  sampler (first K shards by filename; normalize's uniform partitioning
  makes this byte-fair and content-fair)
* experiments/datakit_testbed/dag.py — wires
  download -> normalize -> sample -> noop_dedup -> consolidate
  with downloads grouped by (hf_id, revision, staged_path, urls_glob)
* experiments/datakit_testbed/mixture.py — proportional mixture builder
  over tokenized caches, weighting by rough_token_count_b
* experiments/datakit_testbed/train.py — Grug-MoE harness with simulated
  epoching (target_budget / experiment_budget on LmDataConfig)
* experiments/ferries/datakit_testbed_ferry.py — entry point with
  us-central1 region guard

42 offline tests across DAG shape, sampler behavior, noop dedup
end-to-end with consolidate, mixture arithmetic, and simulated-epoching
budget math.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ravwojdyla added a commit that referenced this pull request Apr 25, 2026
Builds the per-source ferry DAG and training wrapper on top of the
merged marin.datakit.sources registry (#5105):

* experiments/datakit_testbed/settings.py — testbed-wide constants
  (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION,
  RAW_TARGET_TOTAL_TOKENS_B)
* experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for
  fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1
  attr-file invariant holds without reading data
* experiments/datakit_testbed/sampler.py — post-normalize by-provenance
  sampler (first K shards by filename; normalize's uniform partitioning
  makes this byte-fair and content-fair)
* experiments/datakit_testbed/dag.py — wires
  download -> normalize -> sample -> noop_dedup -> consolidate
  with downloads grouped by (hf_id, revision, staged_path, urls_glob)
* experiments/datakit_testbed/mixture.py — proportional mixture builder
  over tokenized caches, weighting by rough_token_count_b
* experiments/datakit_testbed/train.py — Grug-MoE harness with simulated
  epoching (target_budget / experiment_budget on LmDataConfig)
* experiments/ferries/datakit_testbed_ferry.py — entry point with
  us-central1 region guard

42 offline tests across DAG shape, sampler behavior, noop dedup
end-to-end with consolidate, mixture arithmetic, and simulated-epoching
budget math.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants