Conversation
marin.datakit.sources defines DatakitSource (frozen dataclass capturing
hf_dataset_id, revision, optional staged_path, schema hints) plus two
cached factories: all_sources() for the full 102-entry set mirrored from
marin-community/token-counts, and pinned_sources() for the 98 entries
with a pinned revision and non-empty HF repo (the subset the ferry can
actually materialize today).
Four entries remain unpinned and carry inline TODOs:
* finetranslations/{multilingual,web} — HF download module missing; the
staged dir has no provenance.json to recover from.
* hplt_v3 — download_hplt_v3_step was removed from the tree; staged dir
has no provenance.json.
* nsf_awards — API-sourced; needs a bespoke download step.
Staging paths were cross-checked against gs://marin-us-central1/raw/:
all 72 unique paths verified present. A follow-up PR will add a CI check
to the datakit smoke workflow to keep this invariant enforced.
Coverage: 109 offline tests exercise the schema, uniqueness of names,
hex-SHA revision format, pinned-subset filtering, and dict-lookup
semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tests asserted invariants the data structure already guarantees: dict keys match source.name because the factory literally builds it that way, pinned_sources() contains only pinned entries because that's its filter, dataclass fields are non-empty because they're defaulted, etc. Per the repo's testing guidance, tests must fail on wrong behavior, not on implementation changes — these failed on either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7a17dbd79
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ``revision``. Others (e.g. ``nsf_awards``, ``hplt_v3``) are carried | ||
| for completeness but need custom wiring before they'll ferry. | ||
| """ | ||
| return {name: src for name, src in all_sources().items() if src.revision and src.hf_dataset_id} |
There was a problem hiding this comment.
Exclude transform-dependent sources from pinned list
pinned_sources() currently treats any entry with revision and hf_dataset_id as ferry-runnable, but several pinned entries in this registry (e.g. coderforge, gpt-oss-rollouts, superior-reasoning, swe-rebench-openhands, synthetic-1) require bespoke row_to_doc transforms in their download modules before text/id exist. Because these entries keep the default text_field="text" and point to raw staged paths, a generic normalize flow that trusts pinned_sources() will hit record[text_field] in normalize._make_normalize_fn and fail on raw schema. This means the function overstates what can be materialized "today" and should filter these out or carry explicit preprocessing/schema metadata.
Useful? React with 👍 / 👎.
| hf_dataset_id="common-pile/arxiv_abstracts_filtered", | ||
| revision="f1d7a9a", | ||
| staged_path="raw/common_pile/arxiv_abstracts_filtered-f1d7a9a", | ||
| rough_token_count_b=0.54, |
There was a problem hiding this comment.
What are the rough token counts drawn from here?
There was a problem hiding this comment.
(totally stolen) from your awesome report: https://huggingface.co/spaces/marin-community/token-count-viewer
* finetranslations/{multilingual,web} — staging at raw/finetranslations_d17a789b
is still running; no provenance.json and no .executor_status=SUCCESS
* common_corpus/english — raw/common_corpus_english-b78a5c1 is missing its
.executor_status marker, so we can't confirm the run completed cleanly
Commented out with TODOs to re-enable once staging is verified; keeps the
registry structurally intact so the diff shows exactly which entries are
paused.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both /multilingual and /web point at the same parallel-corpus dump. They can't be re-enabled as-is without a distinguishing pattern (different text_field, hf_urls_glob, or data_subdir) or they'll normalize to identical rows and double-count the mixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@Helw150 I update this a little bit, disable 3 sources:
|
Builds the per-source ferry DAG and training wrapper on top of the merged marin.datakit.sources registry (#5105): * experiments/datakit_testbed/settings.py — testbed-wide constants (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION, RAW_TARGET_TOTAL_TOKENS_B) * experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1 attr-file invariant holds without reading data * experiments/datakit_testbed/sampler.py — post-normalize by-provenance sampler (first K shards by filename; normalize's uniform partitioning makes this byte-fair and content-fair) * experiments/datakit_testbed/dag.py — wires download -> normalize -> sample -> noop_dedup -> consolidate with downloads grouped by (hf_id, revision, staged_path, urls_glob) * experiments/datakit_testbed/mixture.py — proportional mixture builder over tokenized caches, weighting by rough_token_count_b * experiments/datakit_testbed/train.py — Grug-MoE harness with simulated epoching (target_budget / experiment_budget on LmDataConfig) * experiments/ferries/datakit_testbed_ferry.py — entry point with us-central1 region guard 42 offline tests across DAG shape, sampler behavior, noop dedup end-to-end with consolidate, mixture arithmetic, and simulated-epoching budget math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds the per-source ferry DAG and training wrapper on top of the merged marin.datakit.sources registry (#5105): * experiments/datakit_testbed/settings.py — testbed-wide constants (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION, RAW_TARGET_TOTAL_TOKENS_B) * experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1 attr-file invariant holds without reading data * experiments/datakit_testbed/sampler.py — post-normalize by-provenance sampler (first K shards by filename; normalize's uniform partitioning makes this byte-fair and content-fair) * experiments/datakit_testbed/dag.py — wires download -> normalize -> sample -> noop_dedup -> consolidate with downloads grouped by (hf_id, revision, staged_path, urls_glob) * experiments/datakit_testbed/mixture.py — proportional mixture builder over tokenized caches, weighting by rough_token_count_b * experiments/datakit_testbed/train.py — Grug-MoE harness with simulated epoching (target_budget / experiment_budget on LmDataConfig) * experiments/ferries/datakit_testbed_ferry.py — entry point with us-central1 region guard 42 offline tests across DAG shape, sampler behavior, noop dedup end-to-end with consolidate, mixture arithmetic, and simulated-epoching budget math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds the per-source ferry DAG and training wrapper on top of the merged marin.datakit.sources registry (#5105): * experiments/datakit_testbed/settings.py — testbed-wide constants (TESTBED_TOKENIZER, TESTBED_SEQ_LEN, TESTBED_STAGING_REGION, RAW_TARGET_TOTAL_TOKENS_B) * experiments/datakit_testbed/noop_dedup.py — metadata-only stand-in for fuzzy-dup marking; emits empty attr parquets so consolidate's 1:1 attr-file invariant holds without reading data * experiments/datakit_testbed/sampler.py — post-normalize by-provenance sampler (first K shards by filename; normalize's uniform partitioning makes this byte-fair and content-fair) * experiments/datakit_testbed/dag.py — wires download -> normalize -> sample -> noop_dedup -> consolidate with downloads grouped by (hf_id, revision, staged_path, urls_glob) * experiments/datakit_testbed/mixture.py — proportional mixture builder over tokenized caches, weighting by rough_token_count_b * experiments/datakit_testbed/train.py — Grug-MoE harness with simulated epoching (target_budget / experiment_budget on LmDataConfig) * experiments/ferries/datakit_testbed_ferry.py — entry point with us-central1 region guard 42 offline tests across DAG shape, sampler behavior, noop dedup end-to-end with consolidate, mixture arithmetic, and simulated-epoching budget math. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
marin.datakit.sourcesmirroring the 102 datasets inmarin-community/token-counts; 3 entries currently commented out, leaving 99 activeDatakitSourcefrozen dataclass per entry:hf_dataset_id,revision,staged_path,data_subdir,id_field,text_field,file_extensions,rough_token_count_ball_sources()returns the{name: DatakitSource}map over the active 99,pinned_sources()filters to the 97 entries with both revision and repo set — the subset the ferry can materialize 1gs://marin-us-central1/raw/: every active entry's path is present_filteredHF variants Marin actually downloads (e.g.common-pile/arxiv_abstracts_filtered), not the user-facing display names in the token-counts CSVfinetranslations/*additionally has a TODO about splitting the two accounting halves bytext_field/hf_urls_glob/data_subdirso they don't normalize to identical rowsFootnotes
2 of the 99 active entries are intentionally unpinned (
hplt_v3,nsf_awards) with inlineTODOs:hplt_v3has staged data but noprovenance.jsonto recover the revision, and itsdownload_hplt_v3_stepfunction was removed from the tree;nsf_awardsis API-sourced and needs a bespoke download step. ↩finetranslations/{multilingual,web}— staging atraw/finetranslations_d17a789bis still running, no provenance.json and no.executor_status=SUCCESS.common_corpus/english—raw/common_corpus_english-b78a5c1is missing its.executor_statusmarker, so the run completion can't be confirmed. The integrity check that keeps this invariant enforced is in a follow-up workflow PR — it needs GCS auth so can't be a plain unit test. ↩