feat: add verbose flag to DeviceStatsMonitor for non-expert mode#21706
feat: add verbose flag to DeviceStatsMonitor for non-expert mode#21706deependujha wants to merge 2 commits intoLightning-AI:masterfrom
verbose flag to DeviceStatsMonitor for non-expert mode#21706Conversation
b589eb9 to
ea6eb4d
Compare
There was a problem hiding this comment.
Pull request overview
Adds a verbose flag to DeviceStatsMonitor to support a “non-expert” logging mode that reduces the amount of device statistics emitted to loggers, addressing issue #18652.
Changes:
- Added a
verboseparameter toDeviceStatsMonitor(defaultTrue) and core-metric filtering logic forverbose=False. - Introduced module-level constants defining the “core” metrics and a helper to filter stats dictionaries.
- Added parameterized tests covering
verbose=True/Falsebehavior for CPU and CUDA.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/lightning/pytorch/callbacks/device_stats_monitor.py |
Adds verbose flag and core-metric filtering for device stats logging. |
tests/tests_pytorch/callbacks/test_device_stats_monitor.py |
Adds new parameterized tests validating verbose vs non-verbose logging on CPU and CUDA. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #21706 +/- ##
=========================================
- Coverage 87% 79% -8%
=========================================
Files 270 267 -3
Lines 23973 23922 -51
=========================================
- Hits 20748 18809 -1939
- Misses 3225 5113 +1888 |
What does this PR do?
Fixes #18652
DeviceStatsMonitorpreviously dumped all device stats unconditionally, overwhelming beginner users unfamiliar with PyTorch allocator internals.Added a
verboseflag (defaultTrueto preserve existing behavior).When
verbose=False, only core metrics are logged: memory usage and CPU utilization for CUDA/CPU, and HBM memory metrics for TPU.Changes:
_CORE_DEVICE_STATS_KEYSand_CORE_TPU_STATS_PREFIXESmodule-level constants_filter_core_device_statsstatic method toDeviceStatsMonitorverboseparameter to__init__with updated docstringBefore submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--21706.org.readthedocs.build/en/21706/