fix(runtimes): propagate trainer environment variables to worker processes by AviralKaushal · Pull Request #3454 · kubeflow/trainer

AviralKaushal · 2026-04-25T11:10:31Z

Problem Overview
Issue: Environment variables defined in TrainJob.Spec.Trainer.Env were not reaching the training processes on worker nodes when using the MPI backend. Impact: Users had to manually inject variables using -x flags in mpirun commands, which defeated the purpose of the "TrainJob" abstraction.
Root Cause Analysis
I discovered that the problem existed at two distinct layers:

Layer 1: Operator Logic (Kubernetes Pod Spec)
I identified that the builder.go logic in the Trainer operator had a restrictive check. It only injected environment variables into worker pods if runLauncherAsNode was set to true. Since many MPI jobs run with the launcher as a separate entity, the worker pods were being created without the environment variables even appearing in their Kubernetes container specs.

Layer 2: SSH Handshake (Runtime Handover)
I also realized that even if Kubernetes injected the variables into the worker pod's environment (PID 1), OpenMPI connects to workers via SSH. By default, sshd strips all environment variables for security. This meant the training script—running as a child process of sshd—could not access the user's variables.

My Solution
Phase 1: Fixing the Operator Logic
I modified trainer/pkg/runtime/framework/plugins/jobset/builder.go to unconditionally inject the Env block into any replicated job named node. This ensures that worker nodes always receive the environment variables regardless of the launcher configuration.

Phase 2: Automating the SSH Boundary
Instead of asking users to modify sshd_config or their commands, I utilized a native OpenMPI feature. I updated trainer/pkg/runtime/framework/plugins/mpi/mpi.go to automatically calculate the list of custom environment variables and inject them into the Launcher pod via the OMPI_MCA_mca_base_env_list parameter.

This tells OpenMPI to automatically export these specific variables whenever it starts a process on a remote worker.

Implementation Details
I modified or created the following files to implement the fix:

builder.go
: Removed the restrictive logic check.
mpi.go
: Implemented the automatic OMPI_MCA variable calculation.
constants.go
: Defined the necessary MPI constant.
builder_test.go
: Created a comprehensive suite of unit tests to prevent regressions.
5. Challenges Faced & Overcome
Cluster Synchronization: During testing in the kind cluster, I hit ErrImagePull issues because the deployment's imagePullPolicy was set to Always. I fixed this by patching the deployment to IfNotPresent and reloading my local fix images into the kind nodes.
Verification Accuracy: I initially relied on kubectl logs, but I realized that shell interpolation can hide missing variables. I shifted to using kubectl get pod -o yaml as my absolute source of truth to confirm the container specs were correct.

Final Verification
I have successfully verified the fix with the following results:

Worker Pod YAML: Confirmed MY_CUSTOM_VAR is correctly injected by the operator.
Launcher Pod YAML: Confirmed OMPI_MCA_mca_base_env_list contains the correct variable names.
Unit Tests: Achieved a 100% pass rate for Success, Empty, and Merge scenarios.

7.Verification Logs:
->from kubectl get pods -l mpi-env-bug-test-launcher-0-0-gd7z2 -o yaml
containers:

name: node
command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- /bin/sh
- -c
- 'echo MY_CUSTOM_VAR_VALUE: [${MY_CUSTOM_VAR}] && sleep 600'
  env:
- name: OMPI_MCA_mca_base_env_list
  value: MY_CUSTOM_VAR
- name: MY_CUSTOM_VAR
  value: hello-from-trainjob
  -> kubectl logs mpi-env-bug-test-launcher-0-0-gd7z2
  MY_CUSTOM_VAR_VALUE: [hello-from-trainjob]

google-oss-prow · 2026-04-25T11:10:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-04-25T11:10:40Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR fixes MPI TrainJob env var propagation so that variables defined in TrainJob.Spec.Trainer.Env reliably reach training processes on worker nodes (including across OpenMPI’s SSH launch boundary).

Changes:

Propagate Trainer.Env into any JobSet replicated job named node (independent of runLauncherAsNode).
For OpenMPI, derive an env-var name list from Trainer.Env and set OMPI_MCA_mca_base_env_list on the launcher container to export those variables to remote processes.
Add a unit test for JobSet builder env propagation behavior and introduce a constant for the OpenMPI env-list key.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
pkg/runtime/framework/plugins/mpi/mpi.go	Adds OpenMPI `mca_base_env_list` calculation and env injection on the launcher container.
pkg/runtime/framework/plugins/jobset/builder.go	Removes `runLauncherAsNode` gating so node replicated jobs always receive trainer env/resources.
pkg/constants/constants.go	Defines `OpenMPIEnvBaseEnvList` constant.
pkg/runtime/framework/plugins/jobset/builder_test.go	Adds unit tests validating trainer env propagation into the `node` replicated job.

…esses Address issue kubeflow#3427 by: 1. Ensuring environment variables are injected into MPI worker pods regardless of runLauncherAsNode setting. 2. Automatically populating OMPI_MCA_mca_base_env_list on the launcher to propagate variables across the SSH boundary. 3. Filtering out pod-specific environment variables (ValueFrom) during propagation. 4. Adding comprehensive unit tests for these scenarios. Closes kubeflow#3427 Signed-off-by: AviralKaushal <aviralkaush@gmail.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Sridhar1030 · 2026-04-28T09:20:31Z

 			}
 		}
-		if ancestor == constants.AncestorTrainer || b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node {
+		if ancestor == constants.AncestorTrainer || *rJob.Name == constants.Node {


This change is broader than just MPI. The old condition was gated on isRunLauncherAsNode, so it only kicked in for MPI configs. Now any replicated job named node (PyTorch, DeepSpeed, etc.) will get env vars and resourcesPerNode injected unconditionally. Is that the intent here? If so, might be worth noting that in the PR description since it changes behavior for all runtimes, not just MPI.

Sridhar1030 · 2026-04-28T09:20:31Z

+						envList := ""
+						for i, name := range envNames {
+							if i > 0 {
+								envList += ";"


nit: you could use strings.Join(envNames, ";") here instead of the manual loop. Cleaner and does the same thing.

Sridhar1030 · 2026-04-28T09:20:31Z

+			}
+
+			if len(actualEnv) != len(tc.expectedEnv) {
+				t.Fatalf("Expected %d environment variables, got %d", len(tc.expectedEnv), len(actualEnv))


The existing tests in jobset_test.go use go-cmp (cmp.Diff) for comparing results. This file does manual index-based comparison with t.Fatalf/t.Errorf. Would be good to keep them consistent so the test patterns stay uniform across the package.

Copilot AI review requested due to automatic review settings April 25, 2026 11:10

google-oss-prow Bot requested review from akshaychitneni and kuizhiqing April 25, 2026 11:10

google-oss-prow Bot added the size/L label Apr 25, 2026

Copilot started reviewing on behalf of AviralKaushal April 25, 2026 11:10 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread pkg/runtime/framework/plugins/mpi/mpi.go

Comment thread pkg/runtime/framework/plugins/mpi/mpi.go

AviralKaushal force-pushed the fix/mpi-env-propagation branch from ed37db7 to b83a935 Compare April 25, 2026 11:39

AviralKaushal changed the title ~~fix(mpi): propagate trainer environment variables to worker processes~~ fix(runtimes): propagate trainer environment variables to worker processes Apr 25, 2026

AviralKaushal requested a review from Copilot April 26, 2026 10:38

Copilot started reviewing on behalf of AviralKaushal April 26, 2026 10:39 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread pkg/runtime/framework/plugins/jobset/builder.go

Sridhar1030 reviewed Apr 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runtimes): propagate trainer environment variables to worker processes#3454

fix(runtimes): propagate trainer environment variables to worker processes#3454
AviralKaushal wants to merge 1 commit intokubeflow:masterfrom
AviralKaushal:fix/mpi-env-propagation

AviralKaushal commented Apr 25, 2026

Uh oh!

google-oss-prow Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Sridhar1030 Apr 28, 2026

Uh oh!

Sridhar1030 Apr 28, 2026

Uh oh!

Sridhar1030 Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AviralKaushal commented Apr 25, 2026

Uh oh!

google-oss-prow Bot commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Sridhar1030 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Sridhar1030 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Sridhar1030 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants