Support injecting Torch PET_* envs into trainer init containers

### What you would like to be added?

Trainer v2 injects Torch PET_* envs only to main trainer container.
May I propose to also inject same PET_* envs to all trainer init containers in AncestorTrainer PodSet ?

### Why is this needed?

Init containers can be used for distributed pre-flight checks (network/fabric/storage..etc).
Previous problems:
-  running two separated jobs (first job to do preflight check, second job to run training/inference) will suffer from different scheduling result. 
- doing that in main container will suffer from master_addr's  svc-ip endpoint not ready( e.g. readiness set to vllm ready or master training pod ready) .

Using init-container will help a lot.
```
# --nnodes=$PET_NNODES
# --nproc_per_node=$PET_NPROC_PER_NODE
# --node_rank=$PET_NODE_RANK
# MASTER_ADDR
# MASTER_PORT
torchrun  python -c "$NCCL_TEST"

```



Without PET_* envs, init containers cannot do rank-aware checks, so failures are discovered too late.


Scope
- No CRD/API change
- No scheduler behavior change
- No new user-facing field

will create KEP soon

### Love this feature?

Give it a 👍 We prioritize the features with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support injecting Torch PET_* envs into trainer init containers #3416

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support injecting Torch PET_* envs into trainer init containers #3416

Description

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions