Bootstrap datakit - consolidate and cleanup downloads#4142
Conversation
566d4d9 to
04f7894
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e99b40bb47
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| gcs_output_path=this_output_path(), | ||
| wait_for_completion=True, | ||
| "slimpajama_6b": slimpajama_6b_download().as_executor_step().cd("data"), | ||
| "dolma3_mix_150b_1025": dolma3_mix_150b_1025_download().as_executor_step().cd("15d04ee"), |
There was a problem hiding this comment.
Restore append_sha behavior for dolma3 download
This path now points into a revision subdirectory (.cd("15d04ee")), but the new dolma3_mix_150b_1025_download() StepSpec no longer sets append_sha_to_path=True on DownloadConfig (it used to in the old inline ExecutorStep), so files are written directly under the base output directory instead of <output>/15d04ee. In the dolma3 pipeline, downstream tokenization will read from a non-existent input path and fail to find data.
Useful? React with 👍 / 👎.
841d4d0 to
4f7d13c
Compare
datakit - consolidate and cleanup downloads
rjpower
left a comment
There was a problem hiding this comment.
looks like a nice cleanup!
(at some point i think we discussed moving the steps into some kind of pipeline definition file but i think that's best to do in a separate PR anyway)
There was a problem hiding this comment.
this seems unused, remove?
There was a problem hiding this comment.
I actually have no idea what this file is, but the ar5iv data is definitely used, so chesterton's fence decided to keep it
There was a problem hiding this comment.
nit: ar5iv/download -> ar5iv.py
|
haha no worries, i looked like an old test case file, but feel free to keep
if it makes something stay working
…On Thu, Mar 26, 2026 at 9:25 AM Rafal Wojdyla ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
On lib/marin/src/marin/datakit/download/ar5iv/ar5iv-v04-2024.json
<#4142?email_source=notifications&email_token=AAEUHZ57RVHMVJDVK5EBKWD4SVKYZA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTIMBRGU3DGMJWG43KM4TFMFZW63VQOJSXM2LFO5PXEZLROVSXG5DFMSSWK5TFNZ2K64DSL5ZGK5TJMV3V6Y3MNFRWW#discussion_r2996183975>
:
I actually have no idea what this file is, but the ar5iv data is
definitely used, so chesterton's fence decided to keep it
—
Reply to this email directly, view it on GitHub
<#4142?email_source=notifications&email_token=AAEUHZ63JXPPPXXRMXFCWGT4SVKYZA5CNFSNUABKM5UWIORPF5TWS5BNNB2WEL2QOVWGYUTFOF2WK43UKJSXM2LFO4XTIMBRGU3DGMJWG43KM4TFMFZW63VQOJSXM2LFO5PXEZLROVSXG5DFMSSWK5TFNZ2L24DSL5ZGK5TJMV3V63TPORUWM2LDMF2GS33OONPWG3DJMNVQ#discussion_r2996183975>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEUHZ6QGKOX2NOT3FBUH3L4SVKYZAVCNFSM6AAAAACW7G6DRCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DAMJVGYZTCNRXGY>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
Implements the first three stages of the datakit pipeline per the design doc (#2355): download_step wraps download_hf, normalize converts raw files to sorted/deduped Parquet with content-hash IDs, and tokenize_step wraps the existing tokenizer for Levanter cache output. Integration test exercises the full DAG (download → normalize → tokenize) via StepRunner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When override_output_path is a relative path (no URL scheme, doesn't start with /), StepSpec.output_path now automatically prepends output_path_prefix or marin_prefix(). This matches the existing Executor behavior and enables datasets to use short relative paths like "raw/fineweb" in StepSpec definitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Converts the single datakit/download.py file into a datakit/download/ package. Moves the HuggingFace download modules (download_hf, stream_remove_columns, upload_gcs_to_hf) into datakit/download/ as their canonical location. Adds download_hf_step() as the StepSpec factory function. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ries Moves nemotron_cc, uncheatable_eval, ar5iv, dclm_hq, wikipedia, and filesystem modules into datakit/download/. Each module gains a *_step() factory returning StepSpec. Renames ambiguous DownloadConfig classes to Ar5ivDownloadConfig and WikipediaDownloadConfig. The uncheatable_eval make_uncheatable_eval_step() is preserved as a compat wrapper around the new uncheatable_eval_step(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All download module implementations now live in marin.datakit.download.*. The old marin.download.* files are replaced with explicit re-exports from the canonical locations. Renamed configs (Ar5ivDownloadConfig, WikipediaDownloadConfig) are re-exported under their original names for backward compat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ining.py Creates canonical StepSpec factory functions for all pretraining dataset downloads (fineweb, dclm, slimpajama, etc.) in pretraining.py. Updates simple.py to import from there and build the backward-compat downloads dict via _build_downloads(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update mock/patch targets in test_huggingface.py, test_nemotron_cc.py, and test_dclm_hq.py to point at the canonical marin.datakit.download.* locations. Add _relative_path_in_source to the HF download shim since the test imports it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates 23 files across experiments/, tests/, and lib/ to import from the canonical marin.datakit.download.* paths. Removes the stale datakit/download.py file left over from the package conversion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates all 23 consumer files to import from marin.datakit.download.* instead of marin.download.*. Refactors download functions (transfer_files, download_nemotron_cc, extract_dclm_hq_dump) to accept plain parameters instead of requiring config dataclass construction. Config classes are kept for backward compat with ExecutorStep callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes NemotronIngressConfig, DCLMHQDownloadConfig, and TransferConfig. The underlying functions (download_nemotron_cc, extract_dclm_hq_dump, transfer_files) now take plain parameters directly. Updates tests and nemotron.py experiment to use the flat-param API or *_step() functions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switches the standard format from Vortex to Parquet throughout the design doc. Notes vortex#6905 as the blocking issue that motivated the change. Parquet provides the same columnar benefits with a mature ecosystem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the module-level hf_fs = HfFileSystem() with per-call construction to avoid side effects at import time. Update the one external consumer (transform_dclm_hq.py) and test mock target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the single-call wrapper functions in pretraining.py and inlines download_hf_step calls directly in simple.py via a _dl() helper. This eliminates the indirection of one function per dataset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All imports have been migrated to marin.datakit.download.*. The shim re-export layer has zero consumers and is now removed. Data files (ar5iv JSON, stackexchange TSV) moved to datakit/download/data/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TSV and README were not referenced by any code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This reverts commit ced1f4f.
Not a download step — it uploads checkpoints from GCS to HuggingFace. Belongs in utilities. Updates the one consumer (exp1063_upload_tootsie). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves ar5iv.py → ar5iv/__init__.py and places ar5iv-v04-2024.json in the same package directory. Removes the now-empty data/ directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
transfer_step and transfer_files have zero consumers in the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Not invoked anywhere in the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean up the download_step alias and __all__. The one consumer (test_datakit.py) now imports download_hf_step directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Simplify the API — override_output_path with relative path support (auto-prefixed by marin_prefix) is sufficient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only consumer was the integration test, which now uses StepSpec with TokenizeConfig directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No production consumers. Simplify integration test to download → tokenize only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the source code location at marin.datakit.download.*. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous commit accidentally removed these when moving the download tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The single-entry dict was unnecessary indirection. All consumers updated to reference the variable directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline _nemotron_cc_path into its only caller. All consumers now call nemotron_cc_download() instead of referencing a global. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates 18 files with import paths and mock targets to reflect the rename from huggingface to huggingface_utils. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire override_output_path through NemotronV2Dataset to download_nemotron_v2_step. Fix missing raw/ prefix on nemotron_cc_math_v1. Add overrides for code_v2, specialized_v1, and sft_v1 to pin existing output paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The _utils suffix was misleading — this is the core HF download module, not a utility helper. Reverts all 19 import paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
datakit/download/dclm_hq.py (CC HTML fetcher) and transform/dolmino/transform_dclm_hq.py (HTML→text converter) have zero experiment consumers. Removes their test as well. The DCLM mixture config in experiments/pretraining_datasets/dclm.py is unrelated and kept. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves the download definition and DOLMINO_DATASETS split metadata into a datakit module. The experiment file now imports from there and only handles tokenization wiring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves download_dolma_step(), DOLMA_DATASETS, and DOLMA_OLMO_MIXTURE_WEIGHTS into a datakit module. The experiment file now imports from there and only handles tokenization wiring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mixture weights are experiment config, not download metadata. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defaults to the enwiki 20241201 dump with override_output_path pointing at raw/wikipedia-a7dad0 where the data already lives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop the revision nesting from download_wikipedia. The step uses override_output_path to point at existing data when no input_urls are provided. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Wikipedia transform step is now a StepSpec with the download
step as a dep, replacing the hardcoded mirrored() path. Converted
to .as_executor_step().cd("20241201") for backward compat with
downstream consumers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The revision creates a subdirectory under output_path for the dump data, matching the original behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both input_urls and revision must be explicitly provided for new downloads. Existing data is still accessed via override_output_path when neither is set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same pattern as Wikipedia: download step with override pointing at existing data, transform step as StepSpec with download as dep. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fineweb (never downloaded), the_stack_dedup, and the_pile_openwebtext2 have no consumers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add append_sha_to_path to download_hf_step, fix dolma3 download
which writes files under {output_path}/{revision} (P1 fix)
- Flatten ar5iv from package to single ar5iv.py, delete unused
ar5iv-v04-2024.json data file
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
76361d8 to
6e75084
Compare
- Move all download modules from marin.download to marin.datakit.download - Delete old marin.download directory entirely (no shims) - Add StepSpec factory functions (download_hf_step, ar5iv_step, etc.) for each downloader - StepSpec.output_path auto-prefixes relative override paths with marin_prefix - Extract pretraining downloads (simple.py) to use download_hf_step directly - Extract nemotron_v2 dataset definitions into datakit/download/nemotron_v2.py - Extract dolma and dolmino downloads into datakit/download/dolma.py and dolmino.py - Add download_wikipedia_step with override pointing at existing data - Wire Wikipedia and ar5iv download+transform in exp934 as StepSpec deps - Remove unused modules: stream_remove_columns, filesystem, dclm_hq, normalize, tokenize - Remove unused config dataclasses (NemotronIngressConfig, DCLMHQDownloadConfig, TransferConfig) - Remove draccus CLI wrappers with zero callers - Remove global HfFileSystem() instance - Move upload_gcs_to_hf to marin.utilities (not a download step) - Move tests to tests/datakit/download/ mirroring source layout - Update all imports across experiments and lib to canonical paths - Update design doc to use Parquet instead of Vortex (notes vortex#6905) Part of #2355 --------- Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Part of #2355