Add MAGIC CLI with runtime DTensor double-backward patch by luciaquirke · Pull Request #174 · EleutherAI/bergson

luciaquirke · 2026-03-06T07:08:42Z

My Changes Summary

Add MAGIC CLI. Brings in DataConfig to start giving us the features the other Bergson CLIs have. Will likely need more work over time to bring to feature parity.
The query reduction currently only works if the number of queries divides nicely across the world size. I have a fix for this in a follow-up PR but this current one was getting ungainly.

Things which could go in this PR or a follow-up

Save config jsons to disk
Use gradcheck to test mixed precision? Or any other ideas for correctness tests?
no_dist is only available on later torch versions (2.9ish+)
support reading configuration from yaml so it's easier to document configurations
per-token scores (PR exists)
padding for batches that don't fit nicely into world size (PR exists)

Claude Changes Summary

Runtime DTensor patch (bergson/magic_patch.py): Monkey-patches Redistribute.backward and _ToTorchTensor.backward at runtime to make FSDP redistribution twice-differentiable (Add support for twice-differentiable DTensor redistribution pytorch/pytorch#160509). Replaces the old magic_wmdp_setup.sh that modified torch source files on disk. Idempotent — call apply_dtensor_patch() before any DTensor double-backward operations.

Test plan

End-to-end MAGIC attribution run on GPU cluster

luciaquirke · 2026-03-07T00:03:02Z

bergson/double_backward.py

+
+
+@dataclass
+class DoubleBackwardConfig:


This is the original RunConfig but it uses DataConfig for the query rather than the first item of the dataset, save_dir renamed to run_path, DataConfig for training data

luciaquirke · 2026-03-07T00:03:43Z

bergson/double_backward.py

+    """Random seed for subset permutation."""
+
+
+def compute_query_gradients(


We could technically use build here with all the Trackstar bells and whistles turned off but this seems more readable. Technically not DRY. Currently lacks TRL-style tokenization/masking support

Duplication is fine for now (and maybe forever)

bergson/double_backward.py

…ight support - Add bergson/magic_patch.py: runtime monkey-patch for twice-differentiable DTensor redistribution (pytorch/pytorch#160509), replacing the old magic_wmdp_setup.sh that modified torch source files on disk - Add per_token mode to DataStream for [n_examples, max_length] weight tensors - Support 2D [B, T] per-token weights in weighted_causal_lm_ce - Fix backward weight_grads accumulation when autograd returns None

for more information, see https://pre-commit.ci

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The weight gradient from autograd.grad should always be a tensor since data.weights participates in the computation graph via weighted_causal_lm_ce. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Multiple concurrent DCP async_save calls each create their own Gloo process group. With consecutive saves at steps 20-24 (last_start logic), up to 5 saves were in-flight simultaneously. Background threads from these saves may call distributed operations that conflict, causing all ranks to deadlock in fut.result() until the NCCL watchdog times out. Limit to one concurrent save at a time: wait for the previous save to complete before starting the next one. Each save still overlaps with at least one training step, so async I/O benefit is preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Raises a clear ValueError at init time when the dataset doesn't have enough examples for the requested number of batches, instead of crashing with an IndexError mid-training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PyTorch's Future.result() waits for done callbacks to complete before returning. The destroy_process_group callback was invoked from DCP's background thread after each save, but destroy_process_group may do a barrier on the Gloo group. Since ranks complete their I/O at different times, the fast rank would deadlock waiting for the slow rank to also call destroy_process_group, while the slow rank was still in fut.result(). DCP holds its own reference to the process group, keeping it alive for the duration of the background I/O. GC will clean it up afterwards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bergson/double_backward.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Strip per_token parameter from DataStream and 2D weight path from weighted_causal_lm_ce to keep the merge scope minimal. The per-token code is preserved on the magic-per-token branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

norabelrose · 2026-03-12T23:13:54Z

bergson/magic_patch.py

+"""Runtime monkey-patch for twice-differentiable DTensor redistribution.
+
+Implements pytorch/pytorch#160509 at runtime, avoiding the need to modify
+torch source files on disk. Call `apply_dtensor_patch()` before any DTensor


Ah, good idea using monkey patching rather than actually changing the files

norabelrose · 2026-03-12T23:17:14Z

bergson/double_backward.py

+    """Random seed for subset permutation."""
+
+
+def compute_query_gradients(


Duplication is fine for now (and maybe forever)

norabelrose · 2026-03-12T23:17:52Z

bergson/double_backward.py

+    data: DataConfig = field(default_factory=DataConfig)
+    """Training dataset."""
+
+    query: DataConfig = field(default_factory=lambda: DataConfig())


nitpick: you should write field(default_factory=DataConfig) like you did above lol

hehe whoops

luciaquirke · 2026-03-13T08:38:54Z

@norabelrose could you please update the query handling so it works for numbers of queries that aren't divisible by the world size, without dropping data? This should enable the double backward example script to replicate.

luciaquirke · 2026-03-13T08:43:09Z

I think once we can replicate the good spearman correlations with the CLI it's basically mergeable

luciaquirke · 2026-03-13T10:46:57Z

Actually I'm going to merge now so we can get the wandb logging up, can we please add the query logic in a follow up work?

luciaquirke changed the title ~~Add runtime DTensor double-backward patch and per-token weights~~ Add runtime DTensor double-backward patch and per-token weights; Add MAGIC CLI Mar 7, 2026

luciaquirke commented Mar 7, 2026

View reviewed changes

bergson/double_backward.py Outdated Show resolved Hide resolved

luciaquirke commented Mar 7, 2026

View reviewed changes

bergson/double_backward.py Show resolved Hide resolved

luciaquirke commented Mar 7, 2026

View reviewed changes

bergson/double_backward.py Show resolved Hide resolved

luciaquirke commented Mar 7, 2026

View reviewed changes

bergson/double_backward.py Outdated Show resolved Hide resolved

luciaquirke force-pushed the magic-dtensor-patch branch 3 times, most recently from 77497a6 to 1be9172 Compare March 7, 2026 00:27

luciaquirke force-pushed the magic-dtensor-patch branch from 1be9172 to 97fe18f Compare March 7, 2026 00:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

9e3a62a

for more information, see https://pre-commit.ci

luciaquirke requested a review from norabelrose March 7, 2026 00:30

luciaquirke and others added 3 commits March 7, 2026 12:03

Drop no_dist from dcp.async_save and dcp.load for torch 2.8 compat

dca41d9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert: restore no_dist in dcp.async_save and dcp.load

b83d3f6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove unnecessary None guard on weight_grads accumulation

491bf06

The weight gradient from autograd.grad should always be a tensor since data.weights participates in the computation graph via weighted_causal_lm_ce. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

luciaquirke removed the request for review from norabelrose March 7, 2026 01:40

luciaquirke and others added 6 commits March 7, 2026 16:33

Add upfront dataset size validation in DataStream

46ea739

Raises a clear ValueError at init time when the dataset doesn't have enough examples for the requested number of batches, instead of crashing with an IndexError mid-training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DEBUG: add logging to trace async save hang

d73cb3b

[pre-commit.ci] auto fixes from pre-commit.com hooks

b8eed22

for more information, see https://pre-commit.ci

Fix E501 line too long in trainer.py debug print

b37a80c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

luciaquirke changed the title ~~Add runtime DTensor double-backward patch and per-token weights; Add MAGIC CLI~~ Add MAGIC CLI, runtime DTensor double-backward patch, and per-token scores Mar 7, 2026

luciaquirke commented Mar 7, 2026

View reviewed changes

bergson/double_backward.py Show resolved Hide resolved

luciaquirke requested a review from norabelrose March 7, 2026 07:52

luciaquirke and others added 2 commits March 7, 2026 07:55

Save run and dist configs to run path in magic CLI

c1e7f61

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

luciaquirke changed the title ~~Add MAGIC CLI, runtime DTensor double-backward patch, and per-token scores~~ Add MAGIC CLI and runtime DTensor double-backward patch Mar 7, 2026

luciaquirke and others added 2 commits March 7, 2026 10:45

Use exact shape assertion for example_weight in weighted_causal_lm_ce

ef891df

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Clean up trainer.py: use assert for validation, remove debug prints

0c1f9ae

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

luciaquirke changed the title ~~Add MAGIC CLI and runtime DTensor double-backward patch~~ Add MAGIC CLI with runtime DTensor double-backward patch Mar 7, 2026

luciaquirke and others added 2 commits March 7, 2026 11:08

Update magic docs: fix CLI usage flags, remove per-token references

bc589cb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

update docs

1bc1303

norabelrose reviewed Mar 13, 2026

View reviewed changes

luciaquirke merged commit 08bc856 into magic Mar 13, 2026
10 checks passed

		"""Random seed for subset permutation."""


		def compute_query_gradients(



		@dataclass
		class DoubleBackwardConfig:

Conversation

luciaquirke commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

My Changes Summary

Things which could go in this PR or a follow-up

Claude Changes Summary

Test plan

Uh oh!

luciaquirke Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

luciaquirke Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

norabelrose Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

norabelrose Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

norabelrose Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

norabelrose Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

luciaquirke Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

luciaquirke commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luciaquirke commented Mar 13, 2026

Uh oh!

luciaquirke commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luciaquirke commented Mar 6, 2026 •

edited

Loading

luciaquirke Mar 7, 2026 •

edited

Loading

luciaquirke commented Mar 13, 2026 •

edited

Loading