Added GPU memory logger. #206

hexinw-nvidia · 2025-10-21T18:58:53Z

It can be enabled via "--ft-enable-gpu-memory-check=true"

apaithankar · 2025-10-29T18:28:01Z

src/nvidia_resiliency_ext/fault_tolerance/config.py

      Reads from SLURM_PROCID (in SLURM environments) or GROUP_RANK (set by launcher). Previous
      rank assignments are ignored to ensure consistency with infrastructure's rank assignment.
      Note: Hot spare/redundancy is NOT supported with this setting. Default: True.
+    * `enable_gpu_memory_check` - If True, log GPU memory usage after worker shutdown to detect


Should we also log (and keep track) of the GPU memory before start of the first cycle?

If there is a difference between used GPU memory before starting and after end of the training process, how does this alert the user? I see it prints a log, but is that sufficient?

Should this be integrated into alerting?

The GPU memory dump will happen on every restart. An offline log processing tool can parse the log to look for GPU memory changes over time. There is no need to log the GPU memory before the start of the first cycle.
An alert can be added to notify the user about a potential GPU memory leak in the InJob restart if it sees memory stat changes over time.

There is no need to log the GPU memory before the start of the first cycle.
I disagree. What will you compare with to know if there was a leak? First cycle memory dump prints the vanilla state.

Please review the latest change. Specifically, _capture_baseline_gpu_memory(). There is a logger.debug message that prints out the memory stat on the first cycle.

…kers. Wait for GPU memory to return to baseline levels before starting new workers after restart. Controlled by --ft-gpu-memory-reclaim-timeout (default 10s). This helps prevent OOM errors from incomplete memory cleanup between restarts.

hexinw-nvidia requested review from apaithankar, namitdhameja, rhewett-nv and sbak5 October 21, 2025 18:58

Added GPU memory logger.

f960b8c

hexinw-nvidia force-pushed the memory branch from dff6d23 to f960b8c Compare October 21, 2025 19:00

rhewett-nv previously approved these changes Oct 21, 2025

View reviewed changes

Merge branch 'main' into memory

639246e

hexinw-nvidia dismissed rhewett-nv’s stale review via 639246e October 22, 2025 17:39

rhewett-nv previously approved these changes Oct 22, 2025

View reviewed changes

Merge branch 'main' into memory

2896cab

hexinw-nvidia added the ci-approved Approved to run CI label Oct 29, 2025

apaithankar reviewed Oct 29, 2025

View reviewed changes

hexinw-nvidia dismissed rhewett-nv’s stale review via a8e6989 November 1, 2025 05:43

hexinw-nvidia force-pushed the memory branch from a8e6989 to 3b2ba5f Compare November 4, 2025 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added GPU memory logger. #206

Added GPU memory logger. #206

Uh oh!

hexinw-nvidia commented Oct 21, 2025

Uh oh!

apaithankar Oct 29, 2025 •

edited

Loading

Uh oh!

hexinw-nvidia Oct 29, 2025

Uh oh!

apaithankar Nov 4, 2025

Uh oh!

hexinw-nvidia Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added GPU memory logger. #206

Are you sure you want to change the base?

Added GPU memory logger. #206

Uh oh!

Conversation

hexinw-nvidia commented Oct 21, 2025

Uh oh!

apaithankar Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hexinw-nvidia Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

apaithankar Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

hexinw-nvidia Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apaithankar Oct 29, 2025 •

edited

Loading