Skip to content

GPU canary: switch from SlimPajama to Nemotron#3704

Draft
yonromai wants to merge 6 commits intocodex/fix-tokenize-input-pathsfrom
romain/gpu-canary-nemotron
Draft

GPU canary: switch from SlimPajama to Nemotron#3704
yonromai wants to merge 6 commits intocodex/fix-tokenize-input-pathsfrom
romain/gpu-canary-nemotron

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

@yonromai yonromai commented Mar 15, 2026

Switch the GPU canary ferry to Nemotron and harden the supporting data path. This adds the Nemotron CC upload fix, larger download-worker memory, a helper to prebuild the upstream data steps, the romain-nt CoreWeave Iris config, and the Zephyr/Iris startup-wait mitigation discovered while debugging.

@yonromai yonromai added the agent-generated Created by automation/agent label Mar 15, 2026
@yonromai yonromai force-pushed the iris-cw-namespace-isolation branch 5 times, most recently from 6af1b20 to fa9c454 Compare March 15, 2026 23:28
Base automatically changed from iris-cw-namespace-isolation to main March 15, 2026 23:38
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch 2 times, most recently from c6cfb15 to d4922bc Compare March 16, 2026 00:14
@yonromai yonromai changed the base branch from main to romain/fix-iris-query-pb2 March 16, 2026 00:14
Base automatically changed from romain/fix-iris-query-pb2 to main March 16, 2026 00:28
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch from d4922bc to d41887d Compare March 16, 2026 00:39
@yonromai yonromai changed the title GPU canary: switch from SlimPajama to Nemotron WIP: GPU canary: switch from SlimPajama to Nemotron Mar 16, 2026
@yonromai
Copy link
Copy Markdown
Contributor Author

Test run update (2026-03-18)

Rebased onto main (#3796) + #3822. Tested on CoreWeave CPU cluster (nemo-canary, 4 nodes, 128 workers).

Results

148 Nemotron CC files landed on R2 (~385MB each, paths with = signs). No OOM, no multipart errors, no GCS egress. Coordinator stable for 10+ min.

Changes since draft

  1. R2 multipart fixopen_url with streaming zstd produces unequal multipart parts that R2 rejects. Switched to local temp file + fs.put().
  2. Retry bumptotal=5 → 10, backoff_factor=1.0 → 2.0 for Common Crawl 503s.
  3. Dropped atomic_renamefs.put() from a local file is atomic on R2.

Dependencies

Known issues (not blocking)

  • write_jsonl_file in zephyr/writers.py has the same R2 multipart vulnerability for large .zst files — worth a follow-up.
  • bigcode/starcoderdata is HF-gated — canary HF_TOKEN lacks access. Independent of Nemotron CC.

@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch from 44d38e1 to c55c11b Compare March 18, 2026 22:19
@yonromai yonromai changed the title WIP: GPU canary: switch from SlimPajama to Nemotron GPU canary: switch from SlimPajama to Nemotron Mar 18, 2026
@yonromai yonromai changed the base branch from main to agent/20260318-fix-3705-v2 March 18, 2026 22:20
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch 2 times, most recently from 44c1257 to 4eb4ff5 Compare March 18, 2026 22:32
with atomic_rename(output_file_path) as temp_path:
with open_url(temp_path, "w", compression="zstd") as out:
# Write locally then fs.put() — R2 rejects streaming multipart with unequal part sizes.
with tempfile.NamedTemporaryFile(suffix=".jsonl.zst", delete=True) as local_tmp:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjpower: streaming zstd through atomic_rename/fs.open produces unequal multipart parts that R2 rejects (see details in PR body). Do you think we should move this fix to zephyr / atomic_rename?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 in theory we could replace this explicit write with a zephyr.write() call in the future but this is fine to keep as a one off for now.

local disk can be kind of tiny on TPU workers, but other than fixing fsspec, there's not too much to do about it...

@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch from 4eb4ff5 to 7862630 Compare March 18, 2026 23:01
Base automatically changed from agent/20260318-fix-3705-v2 to main March 18, 2026 23:13
@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Mar 18, 2026

:shipit: 🚢

@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch 2 times, most recently from 5c7f94d to 22f9af5 Compare March 20, 2026 21:37
@yonromai yonromai changed the base branch from main to configmap-limit March 20, 2026 21:37
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch from 22f9af5 to 01e4694 Compare March 20, 2026 23:33
@yonromai yonromai force-pushed the configmap-limit branch 6 times, most recently from 3268aad to 21f34e1 Compare March 23, 2026 22:59
Base automatically changed from configmap-limit to main March 23, 2026 23:10
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch 12 times, most recently from 72131a9 to c951de4 Compare March 30, 2026 18:41
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch 3 times, most recently from 2d8f565 to 8375859 Compare April 1, 2026 01:15
yonromai and others added 5 commits April 1, 2026 09:09
Both canaries now use the same Nemotron CC dataset
(NEMOTRON_MIX_WITH_DEFAULT_VALIDATION), matching production MoE runs.

Fixes #3635

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each Nemotron CC download worker decompresses a ~350MB zstd file to
~1.5-2GB in memory. The default ZephyrContext resources (1GB) caused
OOMKill when workers exceeded their memory limit. Set 4GB to give
sufficient headroom.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streaming zstd compression via open_url produces unequal S3 multipart
parts that R2 rejects. Write to a local temp file first, then fs.put()
to R2 — this uploads from a seekable file with uniform parts.

Also bumps Common Crawl retry from total=5 to total=10 with
backoff_factor=2.0 to handle transient 503s under load.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CoreWeave controller restarts and delayed worker materialization can leave Zephyr timing out after an hour, then crash-looping late workers. Raise the controller default memory and keep startup waits and cleanup aligned with the broader no-workers budget so future runs fail with actionable diagnostics.
Add a dedicated executor entrypoint to prebuild the Nemotron canary data dependencies and check in the CoreWeave Iris config used to run it.
@yonromai yonromai changed the base branch from main to codex/fix-tokenize-input-paths April 1, 2026 16:09
@yonromai yonromai force-pushed the romain/gpu-canary-nemotron branch from 8375859 to defa05c Compare April 1, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants