Magic dtensor patch by luciaquirke · Pull Request #193 · EleutherAI/bergson

luciaquirke · 2026-03-13T11:12:58Z

No description provided.

…ight support - Add bergson/magic_patch.py: runtime monkey-patch for twice-differentiable DTensor redistribution (pytorch/pytorch#160509), replacing the old magic_wmdp_setup.sh that modified torch source files on disk - Add per_token mode to DataStream for [n_examples, max_length] weight tensors - Support 2D [B, T] per-token weights in weighted_causal_lm_ce - Fix backward weight_grads accumulation when autograd returns None

for more information, see https://pre-commit.ci

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The weight gradient from autograd.grad should always be a tensor since data.weights participates in the computation graph via weighted_causal_lm_ce. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Multiple concurrent DCP async_save calls each create their own Gloo process group. With consecutive saves at steps 20-24 (last_start logic), up to 5 saves were in-flight simultaneously. Background threads from these saves may call distributed operations that conflict, causing all ranks to deadlock in fut.result() until the NCCL watchdog times out. Limit to one concurrent save at a time: wait for the previous save to complete before starting the next one. Each save still overlaps with at least one training step, so async I/O benefit is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Raises a clear ValueError at init time when the dataset doesn't have enough examples for the requested number of batches, instead of crashing with an IndexError mid-training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PyTorch's Future.result() waits for done callbacks to complete before returning. The destroy_process_group callback was invoked from DCP's background thread after each save, but destroy_process_group may do a barrier on the Gloo group. Since ranks complete their I/O at different times, the fast rank would deadlock waiting for the slow rank to also call destroy_process_group, while the slow rank was still in fut.result(). DCP holds its own reference to the process group, keeping it alive for the duration of the background I/O. GC will clean it up afterwards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Strip per_token parameter from DataStream and 2D weight path from weighted_causal_lm_ce to keep the merge scope minimal. The per-token code is preserved on the magic-per-token branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CLAassistant · 2026-03-13T11:15:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ norabelrose
✅ luciaquirke
❌ root

root seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

norabelrose and others added 30 commits September 11, 2025 20:44

Remove LR from callback

c1437fd

Merge branch 'main' of github.com:EleutherAI/bergson

516e377

Merge branch 'main' of github.com:EleutherAI/bergson

00648e2

Simple trainer MVP

b98bb21

Functional trainer

b0def98

Merge branch 'main' of github.com:EleutherAI/bergson

9daaf49

Merge branch 'main' into magic

ead137a

Checkpointed backward

e881f2c

Fully accurate ckpting

b3ec2cd

Merge branch 'main' of github.com:EleutherAI/bergson

b34623c

Merge branch 'main' into magic

8c48f88

FSDP working

3d850cd

Merge branch 'main' of github.com:EleutherAI/bergson

02ba638

Merge branch 'main' into magic

26e0dfe

Remove shrinkage test

c0c5c4a

Fix torchopt import

aade7f2

Create model from scratch

49b7138

Async save

0cb8d65

Reasonable ctx len

27cefeb

Merge branch 'main' of github.com:EleutherAI/bergson

f5c8b41

Merge branch 'main' into magic

234fd50

Refactor to remove functional_call

fd07ba7

[pre-commit.ci] auto fixes from pre-commit.com hooks

9e3a62a

for more information, see https://pre-commit.ci

Drop no_dist from dcp.async_save and dcp.load for torch 2.8 compat

dca41d9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert: restore no_dist in dcp.async_save and dcp.load

b83d3f6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unnecessary None guard on weight_grads accumulation

491bf06

The weight gradient from autograd.grad should always be a tensor since data.weights participates in the computation graph via weighted_causal_lm_ce. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add upfront dataset size validation in DataStream

46ea739

Raises a clear ValueError at init time when the dataset doesn't have enough examples for the requested number of batches, instead of crashing with an IndexError mid-training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

luciaquirke and others added 9 commits March 7, 2026 16:52

DEBUG: add logging to trace async save hang

d73cb3b

[pre-commit.ci] auto fixes from pre-commit.com hooks

b8eed22

for more information, see https://pre-commit.ci

Fix E501 line too long in trainer.py debug print

b37a80c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Save run and dist configs to run path in magic CLI

c1e7f61

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use exact shape assertion for example_weight in weighted_causal_lm_ce

ef891df

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clean up trainer.py: use assert for validation, remove debug prints

0c1f9ae

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update magic docs: fix CLI usage flags, remove per-token references

bc589cb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

update docs

1bc1303

luciaquirke force-pushed the magic branch from d06d7d9 to 1834f5e Compare March 13, 2026 11:14

luciaquirke closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Magic dtensor patch#193

Magic dtensor patch#193
luciaquirke wants to merge 39 commits intomagicfrom
magic-dtensor-patch

luciaquirke commented Mar 13, 2026

Uh oh!

CLAassistant commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

luciaquirke commented Mar 13, 2026

Uh oh!

CLAassistant commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants