GEA training improvements: DDP batch size, sampler epoch, smooth head by pjreddie · Pull Request #575 · allenai/rslearn

pjreddie · 2026-03-30T16:58:19Z

Summary

Three small fixes that came up during GEA ecosystem segmentation finetuning:

data_module: batch_size config now means global batch size — automatically divided by world_size in multi-GPU training so the config means the same thing regardless of GPU count
lightning_module: Fix distributed sampler epoch shuffling — also call set_epoch on the sampler (not just batch_sampler) so data shuffling varies each epoch in DDP
segmentation: Add smooth_sigma option to SegmentationHead — applies differentiable Gaussian blur to logits before loss/softmax (used in GEA smooth head experiments)

Test plan

Used in GEA hyperparameter search experiments (p6_smooth_unfreeze, p8_smooth_train, p9_smooth_unfreeze)
Multi-GPU training verified with correct per-GPU batch sizes

🤖 Generated with Claude Code

- data_module: treat batch_size as global and divide by world_size in multi-GPU training so config means the same thing regardless of GPU count - lightning_module: also call set_epoch on the sampler (not just batch_sampler) so shuffling varies each epoch in distributed training - segmentation: add smooth_sigma option to SegmentationHead for differentiable Gaussian blur on logits before loss computation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

favyen2 · 2026-03-30T17:39:35Z

            path: the dataset path
            path_options: additional options for path to pass to fsspec.
-            batch_size: the batch size
+            batch_size: the total batch size across all GPUs. In multi-GPU


Currently we often set batch_size based on available GPU memory. I don't think the existing option should be changed in behavior; if desired you could deprecate the existing one and add per_gpu_batch_size and global_batch_size options to replace it, and then it should raise error if neither is set or if both are set.

favyen2

The segmentation and train_dataloader.sampler changes look good to me but I think the batch_size behavior should either remain the same or deprecate it and add local_batch_size + global_batch_size options (and deprecated batch_size option should set the local batch size).

meant to comment not approve

favyen2 reviewed Mar 30, 2026

View reviewed changes

favyen2 previously approved these changes Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEA training improvements: DDP batch size, sampler epoch, smooth head#575

GEA training improvements: DDP batch size, sampler epoch, smooth head#575
pjreddie wants to merge 1 commit intomasterfrom
gea-training-improvements

pjreddie commented Mar 30, 2026

Uh oh!

favyen2 Mar 30, 2026

Uh oh!

favyen2 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pjreddie commented Mar 30, 2026

Summary

Test plan

Uh oh!

favyen2 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

favyen2 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants