What you would like to be added?
Trainer v2 injects Torch PET_* envs only to main trainer container.
May I propose to also inject same PET_* envs to all trainer init containers in AncestorTrainer PodSet ?
Why is this needed?
Init containers can be used for distributed pre-flight checks (network/fabric/storage..etc).
Previous problems:
- running two separated jobs (first job to do preflight check, second job to run training/inference) will suffer from different scheduling result.
- doing that in main container will suffer from master_addr's svc-ip endpoint not ready( e.g. readiness set to vllm ready or master training pod ready) .
Using init-container will help a lot.
# --nnodes=$PET_NNODES
# --nproc_per_node=$PET_NPROC_PER_NODE
# --node_rank=$PET_NODE_RANK
# MASTER_ADDR
# MASTER_PORT
torchrun python -c "$NCCL_TEST"
Without PET_* envs, init containers cannot do rank-aware checks, so failures are discovered too late.
Scope
- No CRD/API change
- No scheduler behavior change
- No new user-facing field
will create KEP soon
Love this feature?
Give it a 👍 We prioritize the features with most 👍
What you would like to be added?
Trainer v2 injects Torch PET_* envs only to main trainer container.
May I propose to also inject same PET_* envs to all trainer init containers in AncestorTrainer PodSet ?
Why is this needed?
Init containers can be used for distributed pre-flight checks (network/fabric/storage..etc).
Previous problems:
Using init-container will help a lot.
Without PET_* envs, init containers cannot do rank-aware checks, so failures are discovered too late.
Scope
will create KEP soon
Love this feature?
Give it a 👍 We prioritize the features with most 👍