Skip to content

Magic dtensor patch#193

Closed
luciaquirke wants to merge 39 commits intomagicfrom
magic-dtensor-patch
Closed

Magic dtensor patch#193
luciaquirke wants to merge 39 commits intomagicfrom
magic-dtensor-patch

Conversation

@luciaquirke
Copy link
Collaborator

No description provided.

norabelrose and others added 30 commits September 11, 2025 20:44
…ight support

- Add bergson/magic_patch.py: runtime monkey-patch for twice-differentiable
  DTensor redistribution (pytorch/pytorch#160509), replacing the old
  magic_wmdp_setup.sh that modified torch source files on disk
- Add per_token mode to DataStream for [n_examples, max_length] weight tensors
- Support 2D [B, T] per-token weights in weighted_causal_lm_ce
- Fix backward weight_grads accumulation when autograd returns None
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The weight gradient from autograd.grad should always be a tensor since
data.weights participates in the computation graph via weighted_causal_lm_ce.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiple concurrent DCP async_save calls each create their own Gloo
process group. With consecutive saves at steps 20-24 (last_start logic),
up to 5 saves were in-flight simultaneously. Background threads from these
saves may call distributed operations that conflict, causing all ranks to
deadlock in fut.result() until the NCCL watchdog times out.

Limit to one concurrent save at a time: wait for the previous save to
complete before starting the next one. Each save still overlaps with at
least one training step, so async I/O benefit is preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Raises a clear ValueError at init time when the dataset doesn't have
enough examples for the requested number of batches, instead of crashing
with an IndexError mid-training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PyTorch's Future.result() waits for done callbacks to complete before
returning. The destroy_process_group callback was invoked from DCP's
background thread after each save, but destroy_process_group may do
a barrier on the Gloo group. Since ranks complete their I/O at different
times, the fast rank would deadlock waiting for the slow rank to also
call destroy_process_group, while the slow rank was still in fut.result().

DCP holds its own reference to the process group, keeping it alive for
the duration of the background I/O. GC will clean it up afterwards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
luciaquirke and others added 9 commits March 7, 2026 16:52
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip per_token parameter from DataStream and 2D weight path from
weighted_causal_lm_ce to keep the merge scope minimal. The per-token
code is preserved on the magic-per-token branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ norabelrose
✅ luciaquirke
❌ root


root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants