Skip to content

Fix DeepSpeed auto batch size crash for DataLoader(batch_size=None)#21669

Open
trentisiete wants to merge 2 commits intoLightning-AI:masterfrom
trentisiete:bugfix/19460_deepspeed-batch-sampler-none
Open

Fix DeepSpeed auto batch size crash for DataLoader(batch_size=None)#21669
trentisiete wants to merge 2 commits intoLightning-AI:masterfrom
trentisiete:bugfix/19460_deepspeed-batch-sampler-none

Conversation

@trentisiete
Copy link
Copy Markdown

@trentisiete trentisiete commented Apr 19, 2026

What does this PR do?

Fixes #19460.

DeepSpeedStrategy._auto_select_batch_size does hasattr(train_dataloader, "batch_sampler") and then dereferences train_dataloader.batch_sampler.batch_size. When the user passes DataLoader(batch_size=None) (common for iterable datasets that yield pre-batched tensors), PyTorch sets batch_sampler to None rather than omitting the attribute, so hasattr still returns True and the dereference raises AttributeError: 'NoneType' object has no attribute 'batch_size'.

Before #19209 the code was wrapped in a broad try/except that silently swallowed the error and fell back to 1. Per @awaelchli 's suggestion on the issue, this PR replaces that with an explicit None check and a rank_zero_warn pointing the user to DeepSpeedStrategy(logging_batch_size_per_gpu=...) so they can set the logging batch size themselves if the default of 1 is not appropriate.

Added a unit test that mocks the data source to return DataLoader(batch_size=None) and asserts both the returned value and the warning. The test is gated on @RunIf(deepspeed=True) like the rest of the file.

Before submitting
  • Was this discussed/agreed via a GitHub issue? Yes, batch_sampler.batch_size is None with deepspeed and DataLoader(batch_size=None) #19460 (maintainer invited a PR).
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? N/A — internal method, no public API change.
  • Did you write any new necessary tests? Yes, test_deepspeed_auto_batch_size_none_batch_sampler.
  • Did you verify new and existing tests pass locally with your changes? Ran ruff/mypy on the changed files; the new test is skipped on my machine (Windows, no deepspeed) like the rest of the file, so I also reproduced the regression and verified the fix path by executing _auto_select_batch_size directly on a DataLoader(batch_size=None).
  • Did you list all the breaking changes introduced by this pull request? None.
  • Did you update the CHANGELOG? Yes, under ### Fixed in src/lightning/pytorch/CHANGELOG.md.

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--21669.org.readthedocs.build/en/21669/

DataLoader(batch_size=None) sets batch_sampler to None, but the previous hasattr check still returned True and dereferencing .batch_size raised AttributeError. Use an explicit None check and warn when the batch size cannot be inferred, falling back to 1 (matching the previous behavior before Lightning-AI#19209).

Fixes Lightning-AI#19460
@github-actions github-actions Bot added the pl Generic label for PyTorch Lightning package label Apr 19, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 19, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (efb7328) to head (cdc8468).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (efb7328) and HEAD (cdc8468). Click for more details.

HEAD has 2418 uploads less than BASE
Flag BASE (efb7328) HEAD (cdc8468)
cpu 584 42
python 42 3
lightning_fabric 187 0
pytest 292 0
python3.12 167 12
lightning 210 15
python3.11 84 6
python3.10 42 3
python3.13 123 9
python3.12.7 126 9
pytorch2.1 42 6
pytorch_lightning 187 27
pytest-full 292 42
pytorch2.7 21 3
pytorch2.10 41 6
pytorch2.8 41 6
pytorch2.2.2 21 3
pytorch2.3 21 3
pytorch2.5.1 21 3
pytorch2.9 42 6
pytorch2.4.1 21 3
pytorch2.6 21 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21669     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23973    23916     -57     
=========================================
- Hits        20751    18810   -1941     
- Misses       3222     5106   +1884     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

batch_sampler.batch_size is None with deepspeed and DataLoader(batch_size=None)

1 participant