GPU canary: switch from SlimPajama to Nemotron#3704
GPU canary: switch from SlimPajama to Nemotron#3704yonromai wants to merge 6 commits intocodex/fix-tokenize-input-pathsfrom
Conversation
6af1b20 to
fa9c454
Compare
c6cfb15 to
d4922bc
Compare
d4922bc to
d41887d
Compare
Test run update (2026-03-18)Rebased onto main (#3796) + #3822. Tested on CoreWeave CPU cluster ( Results
Changes since draft
Dependencies
Known issues (not blocking)
|
44d38e1 to
c55c11b
Compare
44c1257 to
4eb4ff5
Compare
| with atomic_rename(output_file_path) as temp_path: | ||
| with open_url(temp_path, "w", compression="zstd") as out: | ||
| # Write locally then fs.put() — R2 rejects streaming multipart with unequal part sizes. | ||
| with tempfile.NamedTemporaryFile(suffix=".jsonl.zst", delete=True) as local_tmp: |
There was a problem hiding this comment.
@rjpower: streaming zstd through atomic_rename/fs.open produces unequal multipart parts that R2 rejects (see details in PR body). Do you think we should move this fix to zephyr / atomic_rename?
There was a problem hiding this comment.
👍 in theory we could replace this explicit write with a zephyr.write() call in the future but this is fine to keep as a one off for now.
local disk can be kind of tiny on TPU workers, but other than fixing fsspec, there's not too much to do about it...
4eb4ff5 to
7862630
Compare
|
|
5c7f94d to
22f9af5
Compare
4ca8892 to
5ba672c
Compare
22f9af5 to
01e4694
Compare
3268aad to
21f34e1
Compare
72131a9 to
c951de4
Compare
2d8f565 to
8375859
Compare
Both canaries now use the same Nemotron CC dataset (NEMOTRON_MIX_WITH_DEFAULT_VALIDATION), matching production MoE runs. Fixes #3635 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each Nemotron CC download worker decompresses a ~350MB zstd file to ~1.5-2GB in memory. The default ZephyrContext resources (1GB) caused OOMKill when workers exceeded their memory limit. Set 4GB to give sufficient headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streaming zstd compression via open_url produces unequal S3 multipart parts that R2 rejects. Write to a local temp file first, then fs.put() to R2 — this uploads from a seekable file with uniform parts. Also bumps Common Crawl retry from total=5 to total=10 with backoff_factor=2.0 to handle transient 503s under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CoreWeave controller restarts and delayed worker materialization can leave Zephyr timing out after an hour, then crash-looping late workers. Raise the controller default memory and keep startup waits and cleanup aligned with the broader no-workers budget so future runs fail with actionable diagnostics.
Add a dedicated executor entrypoint to prebuild the Nemotron canary data dependencies and check in the CoreWeave Iris config used to run it.
8375859 to
defa05c
Compare
Switch the GPU canary ferry to Nemotron and harden the supporting data path. This adds the Nemotron CC upload fix, larger download-worker memory, a helper to prebuild the upstream data steps, the romain-nt CoreWeave Iris config, and the Zephyr/Iris startup-wait mitigation discovered while debugging.