Skip to content

Commit af8259a

Browse files
authored
Merge branch 'master' into feat/device-stats-monitor-non-expert-mode
2 parents 7e2643d + 2849907 commit af8259a

14 files changed

Lines changed: 181 additions & 14 deletions

File tree

.github/workflows/_legacy-checkpoints.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ jobs:
6060
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
6161

6262
- name: Install uv and set Python version
63-
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
63+
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
6464
with:
6565
python-version: "3.10"
6666
# TODO: Avoid activating environment like this

.github/workflows/ci-tests-fabric.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ jobs:
7474
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
7575

7676
- name: Install uv and set Python version
77-
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
77+
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
7878
with:
7979
python-version: ${{ matrix.config.python-version || '3.10' }}
8080
# TODO: Avoid activating environment like this

.github/workflows/ci-tests-pytorch.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ jobs:
7979
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
8080

8181
- name: Install uv and set Python version
82-
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
82+
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
8383
with:
8484
python-version: ${{ matrix.config.python-version || '3.10' }}
8585
# TODO: Avoid activating environment like this

.github/workflows/code-checks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
3535

3636
- name: Install uv and set Python version
37-
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
37+
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
3838
with:
3939
python-version: "3.11"
4040
# TODO: Avoid activating environment like this

.github/workflows/docs-build.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ jobs:
7474
lfs: ${{ matrix.pkg-name == 'pytorch' }}
7575

7676
- name: Install uv and set Python version
77-
uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0
77+
uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0
7878
with:
7979
python-version: "3.10"
8080
# TODO: Avoid activating environment like this

.github/workflows/release-pkg.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ jobs:
154154
155155
- name: Publish distribution 📦 to PyPI
156156
# pypa/gh-action-pypi-publish v1.13.0
157-
uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e
157+
uses: pypa/gh-action-pypi-publish@cef221092ed1bacb1cc03d23a2d87d1d172e277b
158158
with:
159159
packages_dir: dist/${{ steps.folder.outputs.pkg }}
160160
verbose: true

src/lightning/pytorch/CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,12 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
2424

2525
### Fixed
2626

27+
- Fixed non-zero process exits in `CombinedLoader.reset()` with large tensors and persistent spawned workers by avoiding explicit `_shutdown_workers()` calls and relying on iterator cleanup via `del` [#21708](https://github.com/Lightning-AI/pytorch-lightning/issues/21708)
28+
2729
- Fixed `SIGTERMException` producing a zero exit code instead of 143 (128 + SIGTERM) ([#21623](https://github.com/Lightning-AI/pytorch-lightning/issues/21623))
2830

31+
- Fixed `LightningModule.toggle_optimizer` / `untoggle_optimizer` breaking under `torch.compile` by disabling Dynamo tracing on these bookkeeping helpers ([#21513](https://github.com/Lightning-AI/pytorch-lightning/issues/21513))
32+
2933
---
3034

3135
## [2.6.4] - 2026-05-20

src/lightning/pytorch/callbacks/device_stats_monitor.py

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
from lightning.pytorch.accelerators.cpu import _PSUTIL_AVAILABLE
2828
from lightning.pytorch.callbacks.callback import Callback
2929
from lightning.pytorch.utilities.exceptions import MisconfigurationException
30+
from lightning.pytorch.utilities.rank_zero import rank_zero_warn
3031
from lightning.pytorch.utilities.types import STEP_OUTPUT
3132

3233
_CORE_DEVICE_STATS_KEYS = frozenset([
@@ -120,6 +121,12 @@ class DeviceStatsMonitor(Callback):
120121
verbose: if ``True``, logs all available device stats returned by the accelerator.
121122
If ``False``, logs only a core set of metrics (memory usage, CPU utilization)
122123
that are most relevant for monitoring training health. Defaults to ``True``.
124+
filter_keys: if ``None``, all stats returned by the accelerator are logged.
125+
If a ``set`` of strings is provided, only the keys present in the set will be logged.
126+
Keys are matched against the base metric names before prefixing (e.g.,
127+
``"cpu_percent"`` not ``"DeviceStatsMonitor.on_train_batch_end/cpu_percent"``).
128+
A ``rank_zero_warn`` is emitted for any key in ``filter_keys`` not found in the
129+
collected stats, which helps catch typos early.
123130
124131
Raises:
125132
MisconfigurationException:
@@ -131,13 +138,29 @@ class DeviceStatsMonitor(Callback):
131138
132139
from lightning import Trainer
133140
from lightning.pytorch.callbacks import DeviceStatsMonitor
141+
142+
# log all stats (default behaviour)
134143
device_stats = DeviceStatsMonitor()
135144
trainer = Trainer(callbacks=[device_stats])
136145
146+
# log only peak and current allocated GPU memory
147+
device_stats = DeviceStatsMonitor(
148+
filter_keys={"allocated_bytes.all.current", "allocated_bytes.all.peak"}
149+
)
150+
trainer = Trainer(callbacks=[device_stats])
151+
152+
# log CPU stats alongside a subset of GPU memory stats
153+
device_stats = DeviceStatsMonitor(
154+
cpu_stats=True,
155+
filter_keys={"cpu_percent", "allocated_bytes.all.current"},
156+
)
157+
trainer = Trainer(callbacks=[device_stats])
158+
137159
"""
138160

139-
def __init__(self, cpu_stats: Optional[bool] = None, verbose: bool = False) -> None:
161+
def __init__(self, cpu_stats: Optional[bool] = None, filter_keys: Optional[set[str]] = None, verbose: bool = False) -> None:
140162
self._cpu_stats = cpu_stats
163+
self._filter_keys = filter_keys
141164
self._verbose = verbose
142165

143166
@override
@@ -160,6 +183,21 @@ def setup(
160183
f"`DeviceStatsMonitor` cannot log CPU stats as `psutil` is not installed. {str(_PSUTIL_AVAILABLE)} "
161184
)
162185

186+
if self._filter_keys is not None:
187+
device_stats = trainer.accelerator.get_device_stats(device)
188+
if self._cpu_stats and device.type != "cpu":
189+
from lightning.pytorch.accelerators.cpu import get_cpu_stats
190+
191+
device_stats.update(get_cpu_stats())
192+
193+
unrecognized = self._filter_keys - device_stats.keys()
194+
if unrecognized:
195+
rank_zero_warn(
196+
f"`DeviceStatsMonitor` filter_keys contains keys not found in device stats and will be ignored:"
197+
f" {unrecognized}"
198+
)
199+
200+
163201
@staticmethod
164202
def _filter_core_device_stats(stats: dict[str, float]) -> dict[str, float]:
165203
return {
@@ -187,6 +225,8 @@ def _get_and_log_device_stats(self, trainer: "pl.Trainer", key: str) -> None:
187225

188226
if not self._verbose:
189227
device_stats = self._filter_core_device_stats(device_stats)
228+
if self._filter_keys is not None:
229+
device_stats = {k: v for k, v in device_stats.items() if k in self._filter_keys}
190230

191231
for logger in trainer.loggers:
192232
separator = logger.group_separator

src/lightning/pytorch/core/module.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1136,12 +1136,24 @@ def backward(self, loss):
11361136
else:
11371137
loss.backward(*args, **kwargs)
11381138

1139+
@torch.compiler.disable
11391140
def toggle_optimizer(self, optimizer: Union[Optimizer, LightningOptimizer]) -> None:
11401141
"""Makes sure only the gradients of the current optimizer's parameters are calculated in the training step to
11411142
prevent dangling gradients in multiple-optimizer setup.
11421143
11431144
It works with :meth:`untoggle_optimizer` to make sure ``param_requires_grad_state`` is properly reset.
11441145
1146+
.. note::
1147+
This method is decorated with :func:`torch.compiler.disable` so that it is executed as regular
1148+
Python when the ``LightningModule`` is wrapped with :func:`torch.compile`. Mutating
1149+
``requires_grad`` on parameters is not supported by Dynamo/AOTAutograd (it can change a
1150+
tensor's leaf-ness mid-graph), so tracing this bookkeeping helper would either fail with
1151+
``Unsupported: setattr() on Tensor.requires_grad`` or produce a ``KeyError`` on the
1152+
internal ``param_requires_grad_state`` mapping when the traced parameter references diverge
1153+
from those held by ``trainer.optimizers``. Disabling the compiler on this method keeps the
1154+
behavior identical for eager users while making it safe to call from a compiled
1155+
``training_step``.
1156+
11451157
Args:
11461158
optimizer: The optimizer to toggle.
11471159
@@ -1165,9 +1177,13 @@ def toggle_optimizer(self, optimizer: Union[Optimizer, LightningOptimizer]) -> N
11651177
param.requires_grad = param_requires_grad_state[param]
11661178
self._param_requires_grad_state = param_requires_grad_state
11671179

1180+
@torch.compiler.disable
11681181
def untoggle_optimizer(self, optimizer: Union[Optimizer, LightningOptimizer]) -> None:
11691182
"""Resets the state of required gradients that were toggled with :meth:`toggle_optimizer`.
11701183
1184+
See :meth:`toggle_optimizer` for details on why this method is decorated with
1185+
:func:`torch.compiler.disable`.
1186+
11711187
Args:
11721188
optimizer: The optimizer to untoggle.
11731189

src/lightning/pytorch/utilities/combined_loader.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,7 @@ def _load_state_dicts(self, states: list[dict[str, Any]]) -> None:
397397
def _shutdown_workers_and_reset_iterator(dataloader: object) -> None:
398398
if hasattr(dataloader, "_iterator"):
399399
if isinstance(dataloader._iterator, _MultiProcessingDataLoaderIter):
400-
dataloader._iterator._shutdown_workers()
400+
del dataloader._iterator
401401
dataloader._iterator = None
402402

403403

0 commit comments

Comments
 (0)