Keep the training data continuous and the total batch size constant regardless of changes in the replica world size. #292

zhengchenyu · 2025-11-18T07:35:09Z

The current train_ddp.py has two problems:

It cannot guarantee the sequential reading of each sample. For example, the replica group world size is 3, but only 2 replicas are working. Some samples will be missing.
When the replica group world size changes, the total batch size used for gradient aggregation will change. This makes idempotency computation impossible.

The following modifications were made:

SkipDistributedSampler is provided to ensure that training can resume from any offset.
The dataloader is reconfigured when the quorum changes.
For the training rounds that were just initialized and when the quorum changed, the commit will be abandoned due to the setting of the dirty flag.
Add example train_ddp2.py.

zhengchenyu · 2025-11-28T15:46:35Z

To supplement the experimental results, adjustments were made to train_ddp_fix_batch.py. NUM_EPOCHS was set to 1. BATCH_SIZE was set to 4 to ensure more steps. Then the same initialization model was loaded for each experiment. In the figure below, the base uses a single worker for training. The ft frequently starts or stops the worker, keeping the number of workers consistently between 1 and 3. The loss curves are almost identical.

Using the same experimental conditions, fsdp2 yielded the following results:

I also used the same method on deepspeed stage3 and obtained almost identical curves. However, this PR does not involve deepspeed, so it is not shown here.

Note: A slight inconsistency appears in the latter half of the curve. After debugging, I found that this is due to a loss of precision.

…egardless of changes in the replica world size.

zhengchenyu · 2025-12-06T02:31:30Z

In fsdp2 experiment, I found get_optimizer_state_dict and set_optimizer_state_dict may call optim.step, then will increase step which is used by adam optimizer. An unexpected increase of 1 step will cause inconsistencies. Therefore, in the fsdp2 experiment, I commented out the call to optim.step in _init_optim_state.

d4l3k · 2025-12-06T04:45:18Z

@zhengchenyu this is super cool! I'll take a deeper look when I'm back on Monday

zhengchenyu · 2025-12-06T10:25:16Z

I add sequence diagram to illustrate the start of training and the scaling up from 1 replica world size to 2. local batch = 2, total batch = 4. Each color represents one iteration in the while training process.
The yellow iteration is the start of training . The orange iteration is normal training when replica world size is 1. The green iteration is scaling up from 1 replica world size to 2. The blue iteration is normal training when replica world size is 2.

zhengchenyu · 2025-12-09T08:48:52Z

train_fsdp2_fix_batch.py

+                loss.backward()
+                total_loss += loss.item()
+
+            if accumulation_steps > 1:


TODO: use loss * (1 / accumulation_steps)?

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 18, 2025

zhengchenyu mentioned this pull request Nov 18, 2025

Include Option to Keep Global Batch Size Constant #186

Open

zhengchenyu marked this pull request as draft December 4, 2025 07:23

zhengchenyu closed this Dec 6, 2025

zhengchenyu deleted the fix.totat.batch branch December 6, 2025 01:57

Keep the training data continuous and the total batch size constant r…

1f5f0ec

…egardless of changes in the replica world size.

zhengchenyu reopened this Dec 6, 2025

zhengchenyu force-pushed the fix.totat.batch branch from 9ff17b1 to 1f5f0ec Compare December 6, 2025 02:15

zhengchenyu marked this pull request as ready for review December 6, 2025 02:15

zhengchenyu commented Dec 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep the training data continuous and the total batch size constant regardless of changes in the replica world size. #292

Keep the training data continuous and the total batch size constant regardless of changes in the replica world size. #292

Uh oh!

zhengchenyu commented Nov 18, 2025

Uh oh!

zhengchenyu commented Nov 28, 2025 •

edited

Loading

Uh oh!

zhengchenyu commented Dec 6, 2025 •

edited

Loading

Uh oh!

d4l3k commented Dec 6, 2025

Uh oh!

zhengchenyu commented Dec 6, 2025 •

edited

Loading

Uh oh!

zhengchenyu Dec 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Keep the training data continuous and the total batch size constant regardless of changes in the replica world size. #292

Are you sure you want to change the base?

Keep the training data continuous and the total batch size constant regardless of changes in the replica world size. #292

Uh oh!

Conversation

zhengchenyu commented Nov 18, 2025

Uh oh!

zhengchenyu commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengchenyu commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k commented Dec 6, 2025

Uh oh!

zhengchenyu commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengchenyu Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengchenyu commented Nov 28, 2025 •

edited

Loading

zhengchenyu commented Dec 6, 2025 •

edited

Loading

zhengchenyu commented Dec 6, 2025 •

edited

Loading

zhengchenyu Dec 9, 2025 •

edited

Loading