Skip to content

fix(runtimes): propagate trainer environment variables to worker processes#3454

Open
AviralKaushal wants to merge 1 commit intokubeflow:masterfrom
AviralKaushal:fix/mpi-env-propagation
Open

fix(runtimes): propagate trainer environment variables to worker processes#3454
AviralKaushal wants to merge 1 commit intokubeflow:masterfrom
AviralKaushal:fix/mpi-env-propagation

Conversation

@AviralKaushal
Copy link
Copy Markdown

  1. Problem Overview
    Issue: Environment variables defined in TrainJob.Spec.Trainer.Env were not reaching the training processes on worker nodes when using the MPI backend. Impact: Users had to manually inject variables using -x flags in mpirun commands, which defeated the purpose of the "TrainJob" abstraction.

  2. Root Cause Analysis
    I discovered that the problem existed at two distinct layers:

Layer 1: Operator Logic (Kubernetes Pod Spec)
I identified that the builder.go logic in the Trainer operator had a restrictive check. It only injected environment variables into worker pods if runLauncherAsNode was set to true. Since many MPI jobs run with the launcher as a separate entity, the worker pods were being created without the environment variables even appearing in their Kubernetes container specs.

Layer 2: SSH Handshake (Runtime Handover)
I also realized that even if Kubernetes injected the variables into the worker pod's environment (PID 1), OpenMPI connects to workers via SSH. By default, sshd strips all environment variables for security. This meant the training script—running as a child process of sshd—could not access the user's variables.

  1. My Solution
    Phase 1: Fixing the Operator Logic
    I modified trainer/pkg/runtime/framework/plugins/jobset/builder.go to unconditionally inject the Env block into any replicated job named node. This ensures that worker nodes always receive the environment variables regardless of the launcher configuration.

Phase 2: Automating the SSH Boundary
Instead of asking users to modify sshd_config or their commands, I utilized a native OpenMPI feature. I updated trainer/pkg/runtime/framework/plugins/mpi/mpi.go to automatically calculate the list of custom environment variables and inject them into the Launcher pod via the OMPI_MCA_mca_base_env_list parameter.

This tells OpenMPI to automatically export these specific variables whenever it starts a process on a remote worker.

  1. Implementation Details
    I modified or created the following files to implement the fix:

builder.go
: Removed the restrictive logic check.
mpi.go
: Implemented the automatic OMPI_MCA variable calculation.
constants.go
: Defined the necessary MPI constant.
builder_test.go
: Created a comprehensive suite of unit tests to prevent regressions.
5. Challenges Faced & Overcome
Cluster Synchronization: During testing in the kind cluster, I hit ErrImagePull issues because the deployment's imagePullPolicy was set to Always. I fixed this by patching the deployment to IfNotPresent and reloading my local fix images into the kind nodes.
Verification Accuracy: I initially relied on kubectl logs, but I realized that shell interpolation can hide missing variables. I shifted to using kubectl get pod -o yaml as my absolute source of truth to confirm the container specs were correct.

  1. Final Verification
    I have successfully verified the fix with the following results:

Worker Pod YAML: Confirmed MY_CUSTOM_VAR is correctly injected by the operator.
Launcher Pod YAML: Confirmed OMPI_MCA_mca_base_env_list contains the correct variable names.
Unit Tests: Achieved a 100% pass rate for Success, Empty, and Merge scenarios.

7.Verification Logs:
->from kubectl get pods -l mpi-env-bug-test-launcher-0-0-gd7z2 -o yaml
containers:

  • name: node
    command:
    • mpirun
    • --allow-run-as-root
    • -np
    • "2"
    • /bin/sh
    • -c
    • 'echo MY_CUSTOM_VAR_VALUE: [${MY_CUSTOM_VAR}] && sleep 600'
      env:
    • name: OMPI_MCA_mca_base_env_list
      value: MY_CUSTOM_VAR
    • name: MY_CUSTOM_VAR
      value: hello-from-trainjob
      -> kubectl logs mpi-env-bug-test-launcher-0-0-gd7z2
      MY_CUSTOM_VAR_VALUE: [hello-from-trainjob]

Copilot AI review requested due to automatic review settings April 25, 2026 11:10
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes MPI TrainJob env var propagation so that variables defined in TrainJob.Spec.Trainer.Env reliably reach training processes on worker nodes (including across OpenMPI’s SSH launch boundary).

Changes:

  • Propagate Trainer.Env into any JobSet replicated job named node (independent of runLauncherAsNode).
  • For OpenMPI, derive an env-var name list from Trainer.Env and set OMPI_MCA_mca_base_env_list on the launcher container to export those variables to remote processes.
  • Add a unit test for JobSet builder env propagation behavior and introduce a constant for the OpenMPI env-list key.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
pkg/runtime/framework/plugins/mpi/mpi.go Adds OpenMPI mca_base_env_list calculation and env injection on the launcher container.
pkg/runtime/framework/plugins/jobset/builder.go Removes runLauncherAsNode gating so node replicated jobs always receive trainer env/resources.
pkg/constants/constants.go Defines OpenMPIEnvBaseEnvList constant.
pkg/runtime/framework/plugins/jobset/builder_test.go Adds unit tests validating trainer env propagation into the node replicated job.

Comment thread pkg/runtime/framework/plugins/mpi/mpi.go
Comment thread pkg/runtime/framework/plugins/mpi/mpi.go
…esses

Address issue kubeflow#3427 by:
1. Ensuring environment variables are injected into MPI worker pods regardless of runLauncherAsNode setting.
2. Automatically populating OMPI_MCA_mca_base_env_list on the launcher to propagate variables across the SSH boundary.
3. Filtering out pod-specific environment variables (ValueFrom) during propagation.
4. Adding comprehensive unit tests for these scenarios.

Closes kubeflow#3427

Signed-off-by: AviralKaushal <aviralkaush@gmail.com>
@AviralKaushal AviralKaushal force-pushed the fix/mpi-env-propagation branch from ed37db7 to b83a935 Compare April 25, 2026 11:39
@AviralKaushal AviralKaushal changed the title fix(mpi): propagate trainer environment variables to worker processes fix(runtimes): propagate trainer environment variables to worker processes Apr 25, 2026
@AviralKaushal AviralKaushal requested a review from Copilot April 26, 2026 10:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread pkg/runtime/framework/plugins/jobset/builder.go
}
}
if ancestor == constants.AncestorTrainer || b.isRunLauncherAsNode(info) && *rJob.Name == constants.Node {
if ancestor == constants.AncestorTrainer || *rJob.Name == constants.Node {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is broader than just MPI. The old condition was gated on isRunLauncherAsNode, so it only kicked in for MPI configs. Now any replicated job named node (PyTorch, DeepSpeed, etc.) will get env vars and resourcesPerNode injected unconditionally. Is that the intent here? If so, might be worth noting that in the PR description since it changes behavior for all runtimes, not just MPI.

envList := ""
for i, name := range envNames {
if i > 0 {
envList += ";"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could use strings.Join(envNames, ";") here instead of the manual loop. Cleaner and does the same thing.

}

if len(actualEnv) != len(tc.expectedEnv) {
t.Fatalf("Expected %d environment variables, got %d", len(tc.expectedEnv), len(actualEnv))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing tests in jobset_test.go use go-cmp (cmp.Diff) for comparing results. This file does manual index-based comparison with t.Fatalf/t.Errorf. Would be good to keep them consistent so the test patterns stay uniform across the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants