Questions about the ft_launcher process

In a Slurm system, I used sbatch to launch ft_launcher across two nodes with a total of 16 GPUs, running Megatron-LM’s pretrain_gpt.py. During execution, I inspected the processes with the following command:

ps aux | grep pretrain_gpt.py | grep -v grep

The logs showed:

<img width="1494" height="491" alt="Image" src="https://github.com/user-attachments/assets/fa946b24-6c7e-47a0-a87e-daadb94f2b25" />

The srun ft_launcher process is only visible on the master node, with one instance in the S (sleeping) state.

A total of 9 ft_launcher processes are present: 1 in the Sl state, and 8 in the S state.

My questions are as follows:

The srun ft_launcher process on the master node
What role does this srun ft_launcher process in the S state play in the overall training workflow?

The 8 S-state ft_launcher processes

Could these possibly correspond to the rankMonitorServer processes?

My observation is that the number of these processes scales with the number of ranks.

However, I find this puzzling because:

When I manually kill these processes, the training is unaffected and no restart is triggered.

According to the official documentation, the MonitorServer should only be launched once, not replicated with each rank.

Contradiction in the monitoring logic

If these 8 S-state ft_launcher processes are indeed monitoring processes, then once they are killed, they should no longer be able to detect the liveness of worker processes.

However, in practice, even after killing these processes, when I then kill a worker process, ft_launcher still triggers the worker’s restart mechanism.

This suggests that they may not actually be the monitoring processes.

Therefore, my core question is:
👉 What is the actual role of these 8 S-state ft_launcher processes? If they are not rankMonitorServer, then which processes does ft_launcher rely on to implement its monitoring mechanism?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about the ft_launcher process #182

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about the ft_launcher process #182

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions