Skip to content

feat: add framework-level validation for conflicting env vars across …#3237

Open
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feature/framework-env-conflict-validation
Open

feat: add framework-level validation for conflicting env vars across …#3237
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feature/framework-env-conflict-validation

Conversation

@krishdef7
Copy link
Contributor

What this PR does / why we need it

Adds framework-level validation to detect conflicts between environment variables managed by different runtime plugins (MPI, Torch), and between plugin-reserved envs and user-provided envs in a TrainJob.

Changes

  • Add MPIRunReservedEnvNames constant for OpenMPI reserved env vars
  • Add user-env conflict validation to MPI plugin (mirrors existing Torch pattern)
  • Add EnvVarsReserverPlugin interface for plugins to expose reserved env var names
  • Implement ReservedEnvVarNames() on MPI and Torch plugins
  • Add cross-plugin env conflict detection in RunCustomValidationPlugins
  • Add unit tests for MPI env validation

Which issue(s) this PR fixes

Closes #3147

Copilot AI review requested due to automatic review settings February 23, 2026 13:16
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@krishdef7 krishdef7 force-pushed the feature/framework-env-conflict-validation branch from 1976685 to 9ce322c Compare February 23, 2026 13:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds framework-level validation to detect conflicts between environment variables managed by different runtime plugins (MPI, Torch) and between plugin-reserved envs and user-provided envs in a TrainJob.

Changes:

  • Introduces EnvVarsReserverPlugin interface for plugins to expose reserved environment variable names
  • Implements the interface on MPI and Torch plugins to return their respective reserved env var sets
  • Adds cross-plugin env conflict detection in the framework's validation logic
  • Adds user-env conflict validation to the MPI plugin (mirroring existing Torch pattern)
  • Adds comprehensive test coverage for MPI env validation

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/runtime/framework/interface.go Defines new EnvVarsReserverPlugin interface for exposing reserved env var names
pkg/runtime/framework/core/framework.go Implements cross-plugin env conflict detection in RunCustomValidationPlugins
pkg/runtime/framework/plugins/mpi/mpi.go Implements EnvVarsReserverPlugin and adds user-env validation
pkg/runtime/framework/plugins/torch/torch.go Implements EnvVarsReserverPlugin interface
pkg/constants/constants.go Defines MPIRunReservedEnvNames set with OpenMPI reserved env vars
pkg/runtime/framework/plugins/mpi/mpi_test.go Adds tests for MPI reserved env validation

@krishdef7
Copy link
Contributor Author

Hi @jinchihe @kuizhiqing,

Adding a quick concrete example to clarify the validation:

If a user configures:

  • MPI plugin (reserves OMPI_* envs)
  • Torch plugin
  • and manually sets OMPI_MCA_orte_default_hostfile in spec.trainer.env

This currently passes validation but can lead to runtime issues.

With this PR, validation fails early with a clear error indicating the conflicting env.

Happy to adjust the validation scope or placement if needed.

@krishdef7
Copy link
Contributor Author

Hi @jinchihe @kuizhiqing, happy to make any changes based on your feedback. Let me know if the approach or placement needs adjustment.

@krishdef7
Copy link
Contributor Author

@jinchihe @kuizhiqing

Addressed Copilot feedback and added concrete MPI/Torch conflict example above.

Let me know if any changes are needed, otherwise happy for /lgtm.

@krishdef7 krishdef7 force-pushed the feature/framework-env-conflict-validation branch 2 times, most recently from d211923 to d62bb77 Compare March 5, 2026 17:00
@krishdef7
Copy link
Contributor Author

krishdef7 commented Mar 5, 2026

@jinchihe @kuizhiqing

Rebased onto latest master and resolved merge conflicts.
Updated MPI validation logic to align with the new NumProcPerNode type.

CI checks are passing. Let me know if any adjustments are needed.

@krishdef7 krishdef7 force-pushed the feature/framework-env-conflict-validation branch from 3887381 to d232599 Compare March 9, 2026 17:01
@krishdef7
Copy link
Contributor Author

@jinchihe @kuizhiqing, Rebased to fix DCO sign-off and removed an accidentally included commit from another branch. The PR now contains only the env conflict validation changes across 6 files. DCO is green. Happy for review when you get a chance

@krishdef7 krishdef7 force-pushed the feature/framework-env-conflict-validation branch 3 times, most recently from 98165ed to 9fff59c Compare March 12, 2026 09:37
@krishdef7
Copy link
Contributor Author

@andreyvelich @akshaychitneni, this PR implements the env conflict validation from the TODO you left in #2308. Cleaned up a dead NumProcPerNode validation block in this latest push. All CI passing, happy for review when you get a chance.

…s plugins

Add cross-plugin environment variable conflict detection to prevent
runtime issues when plugins share reserved env names:

- Add MPIRunReservedEnvNames constant for OpenMPI reserved env vars
- Add EnvVarsReserverPlugin interface for plugins to expose reserved env names
- Implement ReservedEnvVarNames() on MPI and Torch plugins
- Add user-env conflict validation to MPI plugin (mirrors existing Torch pattern)
- Add cross-plugin conflict detection in RunCustomValidationPlugins
- Add unit tests for MPI env validation

Closes kubeflow#3147

Signed-off-by: krishdef7 <gargkrish06@gmail.com>
@krishdef7 krishdef7 force-pushed the feature/framework-env-conflict-validation branch from 9fff59c to b10a47c Compare March 14, 2026 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Framework-level validation for conflicting environment variables across runtime plugins

2 participants