Skip to content

Conversation

@hexinw-nvidia
Copy link
Contributor

It can be enabled via "--ft-enable-gpu-memory-check=true"

rhewett-nv
rhewett-nv previously approved these changes Oct 21, 2025
rhewett-nv
rhewett-nv previously approved these changes Oct 22, 2025
@hexinw-nvidia hexinw-nvidia added the ci-approved Approved to run CI label Oct 29, 2025
Reads from SLURM_PROCID (in SLURM environments) or GROUP_RANK (set by launcher). Previous
rank assignments are ignored to ensure consistency with infrastructure's rank assignment.
Note: Hot spare/redundancy is NOT supported with this setting. Default: True.
* `enable_gpu_memory_check` - If True, log GPU memory usage after worker shutdown to detect
Copy link
Contributor

@apaithankar apaithankar Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should we also log (and keep track) of the GPU memory before start of the first cycle?
  • If there is a difference between used GPU memory before starting and after end of the training process, how does this alert the user? I see it prints a log, but is that sufficient?
  • Should this be integrated into alerting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU memory dump will happen on every restart. An offline log processing tool can parse the log to look for GPU memory changes over time. There is no need to log the GPU memory before the start of the first cycle.
An alert can be added to notify the user about a potential GPU memory leak in the InJob restart if it sees memory stat changes over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to log the GPU memory before the start of the first cycle.
I disagree. What will you compare with to know if there was a leak? First cycle memory dump prints the vanilla state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the latest change. Specifically, _capture_baseline_gpu_memory(). There is a logger.debug message that prints out the memory stat on the first cycle.

…kers.

Wait for GPU memory to return to baseline levels before starting new workers
after restart. Controlled by --ft-gpu-memory-reclaim-timeout (default 10s).

This helps prevent OOM errors from incomplete memory cleanup between restarts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved Approved to run CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants