fix(runtimes): propagate trainer environment variables to worker processes#3454
fix(runtimes): propagate trainer environment variables to worker processes#3454AviralKaushal wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR fixes MPI TrainJob env var propagation so that variables defined in TrainJob.Spec.Trainer.Env reliably reach training processes on worker nodes (including across OpenMPI’s SSH launch boundary).
Changes:
- Propagate
Trainer.Envinto any JobSet replicated job namednode(independent ofrunLauncherAsNode). - For OpenMPI, derive an env-var name list from
Trainer.Envand setOMPI_MCA_mca_base_env_liston the launcher container to export those variables to remote processes. - Add a unit test for JobSet builder env propagation behavior and introduce a constant for the OpenMPI env-list key.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/runtime/framework/plugins/mpi/mpi.go | Adds OpenMPI mca_base_env_list calculation and env injection on the launcher container. |
| pkg/runtime/framework/plugins/jobset/builder.go | Removes runLauncherAsNode gating so node replicated jobs always receive trainer env/resources. |
| pkg/constants/constants.go | Defines OpenMPIEnvBaseEnvList constant. |
| pkg/runtime/framework/plugins/jobset/builder_test.go | Adds unit tests validating trainer env propagation into the node replicated job. |
…esses Address issue kubeflow#3427 by: 1. Ensuring environment variables are injected into MPI worker pods regardless of runLauncherAsNode setting. 2. Automatically populating OMPI_MCA_mca_base_env_list on the launcher to propagate variables across the SSH boundary. 3. Filtering out pod-specific environment variables (ValueFrom) during propagation. 4. Adding comprehensive unit tests for these scenarios. Closes kubeflow#3427 Signed-off-by: AviralKaushal <aviralkaush@gmail.com>
ed37db7 to
b83a935
Compare
| } | ||
| } | ||
| if ancestor == constants.AncestorTrainer || b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node { | ||
| if ancestor == constants.AncestorTrainer || *rJob.Name == constants.Node { |
There was a problem hiding this comment.
This change is broader than just MPI. The old condition was gated on isRunLauncherAsNode, so it only kicked in for MPI configs. Now any replicated job named node (PyTorch, DeepSpeed, etc.) will get env vars and resourcesPerNode injected unconditionally. Is that the intent here? If so, might be worth noting that in the PR description since it changes behavior for all runtimes, not just MPI.
| envList := "" | ||
| for i, name := range envNames { | ||
| if i > 0 { | ||
| envList += ";" |
There was a problem hiding this comment.
nit: you could use strings.Join(envNames, ";") here instead of the manual loop. Cleaner and does the same thing.
| } | ||
|
|
||
| if len(actualEnv) != len(tc.expectedEnv) { | ||
| t.Fatalf("Expected %d environment variables, got %d", len(tc.expectedEnv), len(actualEnv)) |
There was a problem hiding this comment.
The existing tests in jobset_test.go use go-cmp (cmp.Diff) for comparing results. This file does manual index-based comparison with t.Fatalf/t.Errorf. Would be good to keep them consistent so the test patterns stay uniform across the package.
Problem Overview
Issue: Environment variables defined in TrainJob.Spec.Trainer.Env were not reaching the training processes on worker nodes when using the MPI backend. Impact: Users had to manually inject variables using -x flags in mpirun commands, which defeated the purpose of the "TrainJob" abstraction.
Root Cause Analysis
I discovered that the problem existed at two distinct layers:
Layer 1: Operator Logic (Kubernetes Pod Spec)
I identified that the builder.go logic in the Trainer operator had a restrictive check. It only injected environment variables into worker pods if runLauncherAsNode was set to true. Since many MPI jobs run with the launcher as a separate entity, the worker pods were being created without the environment variables even appearing in their Kubernetes container specs.
Layer 2: SSH Handshake (Runtime Handover)
I also realized that even if Kubernetes injected the variables into the worker pod's environment (PID 1), OpenMPI connects to workers via SSH. By default, sshd strips all environment variables for security. This meant the training script—running as a child process of sshd—could not access the user's variables.
Phase 1: Fixing the Operator Logic
I modified trainer/pkg/runtime/framework/plugins/jobset/builder.go to unconditionally inject the Env block into any replicated job named node. This ensures that worker nodes always receive the environment variables regardless of the launcher configuration.
Phase 2: Automating the SSH Boundary
Instead of asking users to modify sshd_config or their commands, I utilized a native OpenMPI feature. I updated trainer/pkg/runtime/framework/plugins/mpi/mpi.go to automatically calculate the list of custom environment variables and inject them into the Launcher pod via the OMPI_MCA_mca_base_env_list parameter.
This tells OpenMPI to automatically export these specific variables whenever it starts a process on a remote worker.
I modified or created the following files to implement the fix:
builder.go
: Removed the restrictive logic check.
mpi.go
: Implemented the automatic OMPI_MCA variable calculation.
constants.go
: Defined the necessary MPI constant.
builder_test.go
: Created a comprehensive suite of unit tests to prevent regressions.
5. Challenges Faced & Overcome
Cluster Synchronization: During testing in the kind cluster, I hit ErrImagePull issues because the deployment's imagePullPolicy was set to Always. I fixed this by patching the deployment to IfNotPresent and reloading my local fix images into the kind nodes.
Verification Accuracy: I initially relied on kubectl logs, but I realized that shell interpolation can hide missing variables. I shifted to using kubectl get pod -o yaml as my absolute source of truth to confirm the container specs were correct.
I have successfully verified the fix with the following results:
Worker Pod YAML: Confirmed MY_CUSTOM_VAR is correctly injected by the operator.
Launcher Pod YAML: Confirmed OMPI_MCA_mca_base_env_list contains the correct variable names.
Unit Tests: Achieved a 100% pass rate for Success, Empty, and Merge scenarios.
7.Verification Logs:
->from kubectl get pods -l mpi-env-bug-test-launcher-0-0-gd7z2 -o yaml
containers:
command:
env:
value: MY_CUSTOM_VAR
value: hello-from-trainjob
-> kubectl logs mpi-env-bug-test-launcher-0-0-gd7z2
MY_CUSTOM_VAR_VALUE: [hello-from-trainjob]