Skip to content

Bootstrap datakit - consolidate and cleanup downloads#4142

Merged
ravwojdyla merged 56 commits intomainfrom
rav-datakit-norm
Mar 26, 2026
Merged

Bootstrap datakit - consolidate and cleanup downloads#4142
ravwojdyla merged 56 commits intomainfrom
rav-datakit-norm

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Mar 25, 2026

  • Move all download modules from marin.download to marin.datakit.download
  • Delete old marin.download directory entirely (no shims)
  • Add StepSpec factory functions (download_hf_step, ar5iv_step, etc.) for each downloader
  • StepSpec.output_path auto-prefixes relative override paths with marin_prefix
  • Extract pretraining downloads (simple.py) to use download_hf_step directly
  • Extract nemotron_v2 dataset definitions into datakit/download/nemotron_v2.py
  • Extract dolma and dolmino downloads into datakit/download/dolma.py and dolmino.py
  • Add download_wikipedia_step with override pointing at existing data
  • Wire Wikipedia and ar5iv download+transform in exp934 as StepSpec deps
  • Remove unused modules: stream_remove_columns, filesystem, dclm_hq, normalize, tokenize
  • Remove unused config dataclasses (NemotronIngressConfig, DCLMHQDownloadConfig, TransferConfig)
  • Remove draccus CLI wrappers with zero callers
  • Remove global HfFileSystem() instance
  • Move upload_gcs_to_hf to marin.utilities (not a download step)
  • Move tests to tests/datakit/download/ mirroring source layout
  • Update all imports across experiments and lib to canonical paths
  • Update design doc to use Parquet instead of Vortex (notes vortex#6905)

Part of #2355

@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Mar 25, 2026
@ravwojdyla ravwojdyla marked this pull request as ready for review March 25, 2026 21:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e99b40bb47

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gcs_output_path=this_output_path(),
wait_for_completion=True,
"slimpajama_6b": slimpajama_6b_download().as_executor_step().cd("data"),
"dolma3_mix_150b_1025": dolma3_mix_150b_1025_download().as_executor_step().cd("15d04ee"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore append_sha behavior for dolma3 download

This path now points into a revision subdirectory (.cd("15d04ee")), but the new dolma3_mix_150b_1025_download() StepSpec no longer sets append_sha_to_path=True on DownloadConfig (it used to in the old inline ExecutorStep), so files are written directly under the base output directory instead of <output>/15d04ee. In the dolma3 pipeline, downstream tokenization will read from a non-existent input path and fail to find data.

Useful? React with 👍 / 👎.

@ravwojdyla ravwojdyla requested review from Helw150 and rjpower March 26, 2026 02:51
@ravwojdyla ravwojdyla changed the title [WIP] [datakit] Add download, normalize, tokenize modules and consolidate download code Bootstrap datakit - consolidate and cleanup downloads Mar 26, 2026
Copy link
Copy Markdown
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a nice cleanup!

(at some point i think we discussed moving the steps into some kind of pipeline definition file but i think that's best to do in a separate PR anyway)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems unused, remove?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually have no idea what this file is, but the ar5iv data is definitely used, so chesterton's fence decided to keep it

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ar5iv/download -> ar5iv.py

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Mar 26, 2026 via email

ravwojdyla and others added 20 commits March 26, 2026 16:32
Implements the first three stages of the datakit pipeline per the design
doc (#2355): download_step wraps download_hf, normalize converts raw
files to sorted/deduped Parquet with content-hash IDs, and tokenize_step
wraps the existing tokenizer for Levanter cache output. Integration test
exercises the full DAG (download → normalize → tokenize) via StepRunner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When override_output_path is a relative path (no URL scheme, doesn't
start with /), StepSpec.output_path now automatically prepends
output_path_prefix or marin_prefix(). This matches the existing
Executor behavior and enables datasets to use short relative paths
like "raw/fineweb" in StepSpec definitions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Converts the single datakit/download.py file into a
datakit/download/ package. Moves the HuggingFace download modules
(download_hf, stream_remove_columns, upload_gcs_to_hf) into
datakit/download/ as their canonical location. Adds download_hf_step()
as the StepSpec factory function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ries

Moves nemotron_cc, uncheatable_eval, ar5iv, dclm_hq, wikipedia, and
filesystem modules into datakit/download/. Each module gains a *_step()
factory returning StepSpec. Renames ambiguous DownloadConfig classes to
Ar5ivDownloadConfig and WikipediaDownloadConfig. The uncheatable_eval
make_uncheatable_eval_step() is preserved as a compat wrapper around
the new uncheatable_eval_step().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All download module implementations now live in marin.datakit.download.*.
The old marin.download.* files are replaced with explicit re-exports from
the canonical locations. Renamed configs (Ar5ivDownloadConfig,
WikipediaDownloadConfig) are re-exported under their original names
for backward compat.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ining.py

Creates canonical StepSpec factory functions for all pretraining dataset
downloads (fineweb, dclm, slimpajama, etc.) in pretraining.py. Updates
simple.py to import from there and build the backward-compat downloads
dict via _build_downloads().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update mock/patch targets in test_huggingface.py, test_nemotron_cc.py,
and test_dclm_hq.py to point at the canonical marin.datakit.download.*
locations. Add _relative_path_in_source to the HF download shim since
the test imports it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates 23 files across experiments/, tests/, and lib/ to import from
the canonical marin.datakit.download.* paths. Removes the stale
datakit/download.py file left over from the package conversion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates all 23 consumer files to import from marin.datakit.download.*
instead of marin.download.*. Refactors download functions (transfer_files,
download_nemotron_cc, extract_dclm_hq_dump) to accept plain parameters
instead of requiring config dataclass construction. Config classes are
kept for backward compat with ExecutorStep callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes NemotronIngressConfig, DCLMHQDownloadConfig, and TransferConfig.
The underlying functions (download_nemotron_cc, extract_dclm_hq_dump,
transfer_files) now take plain parameters directly. Updates tests and
nemotron.py experiment to use the flat-param API or *_step() functions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Switches the standard format from Vortex to Parquet throughout the
design doc. Notes vortex#6905 as the blocking issue that motivated
the change. Parquet provides the same columnar benefits with a
mature ecosystem.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the module-level hf_fs = HfFileSystem() with per-call
construction to avoid side effects at import time. Update the
one external consumer (transform_dclm_hq.py) and test mock target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removes the single-call wrapper functions in pretraining.py and inlines
download_hf_step calls directly in simple.py via a _dl() helper. This
eliminates the indirection of one function per dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All imports have been migrated to marin.datakit.download.*. The shim
re-export layer has zero consumers and is now removed. Data files
(ar5iv JSON, stackexchange TSV) moved to datakit/download/data/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TSV and README were not referenced by any code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Not a download step — it uploads checkpoints from GCS to HuggingFace.
Belongs in utilities. Updates the one consumer (exp1063_upload_tootsie).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves ar5iv.py → ar5iv/__init__.py and places ar5iv-v04-2024.json
in the same package directory. Removes the now-empty data/ directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
transfer_step and transfer_files have zero consumers in the codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ravwojdyla and others added 26 commits March 26, 2026 16:32
Not invoked anywhere in the codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean up the download_step alias and __all__. The one consumer
(test_datakit.py) now imports download_hf_step directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Simplify the API — override_output_path with relative path support
(auto-prefixed by marin_prefix) is sufficient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only consumer was the integration test, which now uses StepSpec
with TokenizeConfig directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No production consumers. Simplify integration test to download →
tokenize only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirrors the source code location at marin.datakit.download.*.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous commit accidentally removed these when moving the
download tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The single-entry dict was unnecessary indirection. All consumers
updated to reference the variable directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Inline _nemotron_cc_path into its only caller. All consumers now
call nemotron_cc_download() instead of referencing a global.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Updates 18 files with import paths and mock targets to reflect
the rename from huggingface to huggingface_utils.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wire override_output_path through NemotronV2Dataset to
download_nemotron_v2_step. Fix missing raw/ prefix on
nemotron_cc_math_v1. Add overrides for code_v2, specialized_v1,
and sft_v1 to pin existing output paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The _utils suffix was misleading — this is the core HF download
module, not a utility helper. Reverts all 19 import paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
datakit/download/dclm_hq.py (CC HTML fetcher) and
transform/dolmino/transform_dclm_hq.py (HTML→text converter) have
zero experiment consumers. Removes their test as well. The DCLM
mixture config in experiments/pretraining_datasets/dclm.py is
unrelated and kept.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves the download definition and DOLMINO_DATASETS split metadata
into a datakit module. The experiment file now imports from there
and only handles tokenization wiring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Moves download_dolma_step(), DOLMA_DATASETS, and
DOLMA_OLMO_MIXTURE_WEIGHTS into a datakit module. The experiment
file now imports from there and only handles tokenization wiring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mixture weights are experiment config, not download metadata.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Defaults to the enwiki 20241201 dump with override_output_path
pointing at raw/wikipedia-a7dad0 where the data already lives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop the revision nesting from download_wikipedia. The step uses
override_output_path to point at existing data when no input_urls
are provided.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Wikipedia transform step is now a StepSpec with the download
step as a dep, replacing the hardcoded mirrored() path. Converted
to .as_executor_step().cd("20241201") for backward compat with
downstream consumers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The revision creates a subdirectory under output_path for the dump
data, matching the original behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both input_urls and revision must be explicitly provided for new
downloads. Existing data is still accessed via override_output_path
when neither is set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same pattern as Wikipedia: download step with override pointing at
existing data, transform step as StepSpec with download as dep.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fineweb (never downloaded), the_stack_dedup, and
the_pile_openwebtext2 have no consumers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add append_sha_to_path to download_hf_step, fix dolma3 download
  which writes files under {output_path}/{revision} (P1 fix)
- Flatten ar5iv from package to single ar5iv.py, delete unused
  ar5iv-v04-2024.json data file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ravwojdyla ravwojdyla merged commit 44fe6ee into main Mar 26, 2026
39 checks passed
@ravwojdyla ravwojdyla deleted the rav-datakit-norm branch March 26, 2026 23:36
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
- Move all download modules from marin.download to
marin.datakit.download
- Delete old marin.download directory entirely (no shims)
- Add StepSpec factory functions (download_hf_step, ar5iv_step, etc.)
for each downloader
- StepSpec.output_path auto-prefixes relative override paths with
marin_prefix
- Extract pretraining downloads (simple.py) to use download_hf_step
directly
- Extract nemotron_v2 dataset definitions into
datakit/download/nemotron_v2.py
- Extract dolma and dolmino downloads into datakit/download/dolma.py and
dolmino.py
- Add download_wikipedia_step with override pointing at existing data
- Wire Wikipedia and ar5iv download+transform in exp934 as StepSpec deps
- Remove unused modules: stream_remove_columns, filesystem, dclm_hq,
normalize, tokenize
- Remove unused config dataclasses (NemotronIngressConfig,
DCLMHQDownloadConfig, TransferConfig)
- Remove draccus CLI wrappers with zero callers
- Remove global HfFileSystem() instance
- Move upload_gcs_to_hf to marin.utilities (not a download step)
- Move tests to tests/datakit/download/ mirroring source layout
- Update all imports across experiments and lib to canonical paths
- Update design doc to use Parquet instead of Vortex (notes vortex#6905)

Part of #2355

---------

Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants