GPU canary: switch from SlimPajama to Nemotron by yonromai · Pull Request #3704 · marin-community/marin

yonromai · 2026-03-15T20:58:53Z

Switch the GPU canary ferry to Nemotron and harden the supporting data path. This adds the Nemotron CC upload fix, larger download-worker memory, a helper to prebuild the upstream data steps, the romain-nt CoreWeave Iris config, and the Zephyr/Iris startup-wait mitigation discovered while debugging.

yonromai · 2026-03-18T18:09:28Z

Test run update (2026-03-18)

Rebased onto main (#3796) + #3822. Tested on CoreWeave CPU cluster (nemo-canary, 4 nodes, 128 workers).

Results

148 Nemotron CC files landed on R2 (~385MB each, paths with = signs). No OOM, no multipart errors, no GCS egress. Coordinator stable for 10+ min.

Changes since draft

R2 multipart fix — open_url with streaming zstd produces unequal multipart parts that R2 rejects. Switched to local temp file + fs.put().
Retry bump — total=5 → 10, backoff_factor=1.0 → 2.0 for Common Crawl 503s.
Dropped atomic_rename — fs.put() from a local file is atomic on R2.

Dependencies

[iris] Terminate replaced coordinators and delete stale endpoints #3822 (coordinator cleanup) — required. Without it, 128-worker fan-out triggers autoscaler node additions that cause the coordinator pod to be rescheduled and the pipeline to hang. See stress test evidence on #3822.

Known issues (not blocking)

write_jsonl_file in zephyr/writers.py has the same R2 multipart vulnerability for large .zst files — worth a follow-up.
bigcode/starcoderdata is HF-gated — canary HF_TOKEN lacks access. Independent of Nemotron CC.

yonromai · 2026-03-18T22:36:43Z

-    with atomic_rename(output_file_path) as temp_path:
-        with open_url(temp_path, "w", compression="zstd") as out:
+    # Write locally then fs.put() — R2 rejects streaming multipart with unequal part sizes.
+    with tempfile.NamedTemporaryFile(suffix=".jsonl.zst", delete=True) as local_tmp:


@rjpower: streaming zstd through atomic_rename/fs.open produces unequal multipart parts that R2 rejects (see details in PR body). Do you think we should move this fix to zephyr / atomic_rename?

👍 in theory we could replace this explicit write with a zephyr.write() call in the future but this is fine to keep as a one off for now.

local disk can be kind of tiny on TPU workers, but other than fixing fsspec, there's not too much to do about it...

rjpower · 2026-03-18T23:18:28Z

🚢

Both canaries now use the same Nemotron CC dataset (NEMOTRON_MIX_WITH_DEFAULT_VALIDATION), matching production MoE runs. Fixes #3635 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each Nemotron CC download worker decompresses a ~350MB zstd file to ~1.5-2GB in memory. The default ZephyrContext resources (1GB) caused OOMKill when workers exceeded their memory limit. Set 4GB to give sufficient headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Streaming zstd compression via open_url produces unequal S3 multipart parts that R2 rejects. Write to a local temp file first, then fs.put() to R2 — this uploads from a seekable file with uniform parts. Also bumps Common Crawl retry from total=5 to total=10 with backoff_factor=2.0 to handle transient 503s under load. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CoreWeave controller restarts and delayed worker materialization can leave Zephyr timing out after an hour, then crash-looping late workers. Raise the controller default memory and keep startup waits and cleanup aligned with the broader no-workers budget so future runs fail with actionable diagnostics.

Add a dedicated executor entrypoint to prebuild the Nemotron canary data dependencies and check in the CoreWeave Iris config used to run it.

yonromai added the agent-generated Created by automation/agent label Mar 15, 2026

yonromai force-pushed the iris-cw-namespace-isolation branch 5 times, most recently from 6af1b20 to fa9c454 Compare March 15, 2026 23:28

Base automatically changed from iris-cw-namespace-isolation to main March 15, 2026 23:38

yonromai force-pushed the romain/gpu-canary-nemotron branch 2 times, most recently from c6cfb15 to d4922bc Compare March 16, 2026 00:14

yonromai changed the base branch from main to romain/fix-iris-query-pb2 March 16, 2026 00:14

Base automatically changed from romain/fix-iris-query-pb2 to main March 16, 2026 00:28

yonromai force-pushed the romain/gpu-canary-nemotron branch from d4922bc to d41887d Compare March 16, 2026 00:39

yonromai changed the title ~~GPU canary: switch from SlimPajama to Nemotron~~ WIP: GPU canary: switch from SlimPajama to Nemotron Mar 16, 2026

rjpower approved these changes Mar 16, 2026

View reviewed changes

yonromai force-pushed the romain/gpu-canary-nemotron branch from 44d38e1 to c55c11b Compare March 18, 2026 22:19

yonromai changed the title ~~WIP: GPU canary: switch from SlimPajama to Nemotron~~ GPU canary: switch from SlimPajama to Nemotron Mar 18, 2026

yonromai changed the base branch from main to agent/20260318-fix-3705-v2 March 18, 2026 22:20

yonromai force-pushed the romain/gpu-canary-nemotron branch 2 times, most recently from 44c1257 to 4eb4ff5 Compare March 18, 2026 22:32

yonromai commented Mar 18, 2026

View reviewed changes

yonromai force-pushed the romain/gpu-canary-nemotron branch from 4eb4ff5 to 7862630 Compare March 18, 2026 23:01

Base automatically changed from agent/20260318-fix-3705-v2 to main March 18, 2026 23:13

rjpower approved these changes Mar 18, 2026

View reviewed changes

yonromai force-pushed the romain/gpu-canary-nemotron branch 2 times, most recently from 5c7f94d to 22f9af5 Compare March 20, 2026 21:37

yonromai changed the base branch from main to configmap-limit March 20, 2026 21:37

yonromai force-pushed the configmap-limit branch from 4ca8892 to 5ba672c Compare March 20, 2026 23:30

yonromai force-pushed the romain/gpu-canary-nemotron branch from 22f9af5 to 01e4694 Compare March 20, 2026 23:33

yonromai force-pushed the configmap-limit branch 6 times, most recently from 3268aad to 21f34e1 Compare March 23, 2026 22:59

Base automatically changed from configmap-limit to main March 23, 2026 23:10

yonromai force-pushed the romain/gpu-canary-nemotron branch 12 times, most recently from 72131a9 to c951de4 Compare March 30, 2026 18:41

yonromai force-pushed the romain/gpu-canary-nemotron branch 3 times, most recently from 2d8f565 to 8375859 Compare April 1, 2026 01:15

yonromai and others added 5 commits April 1, 2026 09:09

GPU canary: switch from SlimPajama to Nemotron

16dca09

Both canaries now use the same Nemotron CC dataset (NEMOTRON_MIX_WITH_DEFAULT_VALIDATION), matching production MoE runs. Fixes #3635 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[ferries] Add Nemotron data helper and CoreWeave config

defa05c

Add a dedicated executor entrypoint to prebuild the Nemotron canary data dependencies and check in the CoreWeave Iris config used to run it.

yonromai changed the base branch from main to codex/fix-tokenize-input-paths April 1, 2026 16:09

yonromai force-pushed the romain/gpu-canary-nemotron branch from 8375859 to defa05c Compare April 1, 2026 16:09

[ferries] Cap Nemotron cache copy fanout on CW

59f0c42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU canary: switch from SlimPajama to Nemotron#3704

GPU canary: switch from SlimPajama to Nemotron#3704
yonromai wants to merge 6 commits intocodex/fix-tokenize-input-pathsfrom
romain/gpu-canary-nemotron

yonromai commented Mar 15, 2026 •

edited

Loading

Uh oh!

yonromai commented Mar 18, 2026

Uh oh!

yonromai Mar 18, 2026

Uh oh!

rjpower Mar 18, 2026

Uh oh!

rjpower commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yonromai commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonromai commented Mar 18, 2026

Test run update (2026-03-18)

Results

Changes since draft

Dependencies

Known issues (not blocking)

Uh oh!

yonromai Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yonromai commented Mar 15, 2026 •

edited

Loading