-
Notifications
You must be signed in to change notification settings - Fork 34
Description
In a Slurm system, I used sbatch to launch ft_launcher across two nodes with a total of 16 GPUs, running Megatron-LM’s pretrain_gpt.py. During execution, I inspected the processes with the following command:
ps aux | grep pretrain_gpt.py | grep -v grep
The logs showed:
The srun ft_launcher process is only visible on the master node, with one instance in the S (sleeping) state.
A total of 9 ft_launcher processes are present: 1 in the Sl state, and 8 in the S state.
My questions are as follows:
The srun ft_launcher process on the master node
What role does this srun ft_launcher process in the S state play in the overall training workflow?
The 8 S-state ft_launcher processes
Could these possibly correspond to the rankMonitorServer processes?
My observation is that the number of these processes scales with the number of ranks.
However, I find this puzzling because:
When I manually kill these processes, the training is unaffected and no restart is triggered.
According to the official documentation, the MonitorServer should only be launched once, not replicated with each rank.
Contradiction in the monitoring logic
If these 8 S-state ft_launcher processes are indeed monitoring processes, then once they are killed, they should no longer be able to detect the liveness of worker processes.
However, in practice, even after killing these processes, when I then kill a worker process, ft_launcher still triggers the worker’s restart mechanism.
This suggests that they may not actually be the monitoring processes.
Therefore, my core question is:
👉 What is the actual role of these 8 S-state ft_launcher processes? If they are not rankMonitorServer, then which processes does ft_launcher rely on to implement its monitoring mechanism?