-
Notifications
You must be signed in to change notification settings - Fork 34
Added GPU memory logger. #206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| Reads from SLURM_PROCID (in SLURM environments) or GROUP_RANK (set by launcher). Previous | ||
| rank assignments are ignored to ensure consistency with infrastructure's rank assignment. | ||
| Note: Hot spare/redundancy is NOT supported with this setting. Default: True. | ||
| * `enable_gpu_memory_check` - If True, log GPU memory usage after worker shutdown to detect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we also log (and keep track) of the GPU memory before start of the first cycle?
- If there is a difference between used GPU memory before starting and after end of the training process, how does this alert the user? I see it prints a log, but is that sufficient?
- Should this be integrated into alerting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPU memory dump will happen on every restart. An offline log processing tool can parse the log to look for GPU memory changes over time. There is no need to log the GPU memory before the start of the first cycle.
An alert can be added to notify the user about a potential GPU memory leak in the InJob restart if it sees memory stat changes over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to log the GPU memory before the start of the first cycle.
I disagree. What will you compare with to know if there was a leak? First cycle memory dump prints the vanilla state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the latest change. Specifically, _capture_baseline_gpu_memory(). There is a logger.debug message that prints out the memory stat on the first cycle.
…kers. Wait for GPU memory to return to baseline levels before starting new workers after restart. Controlled by --ft-gpu-memory-reclaim-timeout (default 10s). This helps prevent OOM errors from incomplete memory cleanup between restarts.
It can be enabled via "--ft-enable-gpu-memory-check=true"