Skip to content

Support injecting Torch PET_* envs into trainer init containers #3416

@panpan0000

Description

@panpan0000

What you would like to be added?

Trainer v2 injects Torch PET_* envs only to main trainer container.
May I propose to also inject same PET_* envs to all trainer init containers in AncestorTrainer PodSet ?

Why is this needed?

Init containers can be used for distributed pre-flight checks (network/fabric/storage..etc).
Previous problems:

  • running two separated jobs (first job to do preflight check, second job to run training/inference) will suffer from different scheduling result.
  • doing that in main container will suffer from master_addr's svc-ip endpoint not ready( e.g. readiness set to vllm ready or master training pod ready) .

Using init-container will help a lot.

# --nnodes=$PET_NNODES
# --nproc_per_node=$PET_NPROC_PER_NODE
# --node_rank=$PET_NODE_RANK
# MASTER_ADDR
# MASTER_PORT
torchrun  python -c "$NCCL_TEST"

Without PET_* envs, init containers cannot do rank-aware checks, so failures are discovered too late.

Scope

  • No CRD/API change
  • No scheduler behavior change
  • No new user-facing field

will create KEP soon

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions