Skip to content

Questions about the ft_launcher process #182

@xiaotaozi121096

Description

@xiaotaozi121096

In a Slurm system, I used sbatch to launch ft_launcher across two nodes with a total of 16 GPUs, running Megatron-LM’s pretrain_gpt.py. During execution, I inspected the processes with the following command:

ps aux | grep pretrain_gpt.py | grep -v grep

The logs showed:

Image

The srun ft_launcher process is only visible on the master node, with one instance in the S (sleeping) state.

A total of 9 ft_launcher processes are present: 1 in the Sl state, and 8 in the S state.

My questions are as follows:

The srun ft_launcher process on the master node
What role does this srun ft_launcher process in the S state play in the overall training workflow?

The 8 S-state ft_launcher processes

Could these possibly correspond to the rankMonitorServer processes?

My observation is that the number of these processes scales with the number of ranks.

However, I find this puzzling because:

When I manually kill these processes, the training is unaffected and no restart is triggered.

According to the official documentation, the MonitorServer should only be launched once, not replicated with each rank.

Contradiction in the monitoring logic

If these 8 S-state ft_launcher processes are indeed monitoring processes, then once they are killed, they should no longer be able to detect the liveness of worker processes.

However, in practice, even after killing these processes, when I then kill a worker process, ft_launcher still triggers the worker’s restart mechanism.

This suggests that they may not actually be the monitoring processes.

Therefore, my core question is:
👉 What is the actual role of these 8 S-state ft_launcher processes? If they are not rankMonitorServer, then which processes does ft_launcher rely on to implement its monitoring mechanism?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions