Skip to content

Conversation

@vsoch
Copy link

@vsoch vsoch commented Oct 31, 2025

What this PR does / why we need it:

This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Using an HPC workload manager like Flux to bootstrap MPI will empower users to run MPI-based and other distributed workloads with advanced scheduling, topology awareness, and a more robust bootstrapping mechanism than traditional SSH-based methods. The proposal introduces a new, extensible hpcPolicy in the TrainJob API, allowing users to select and configure an HPC workload manager, with Flux being the first implementation.

The WIP implementation for the design discussed.

Authors

Myself and @milroy

Which issue(s) this PR fixes This will fix #2841.

Ping @andreyvelich @astefanutti

Checklist:

  • Docs included if any changes are user facing

This KEP proposes adding an hpcPolicy to support Flux
Framework and (in the future) other workload managers
that provide more traditional HPC features.

Signed-off-by: vsoch <[email protected]>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for driving this great feature @vsoch, and sorry for the delay, got swamped with the KubeCon. I left my initial thoughts.

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team

// Manager specifies the workload manager to use.
// +kubebuilder:default="flux"
// +optional
Manager string `json:"manager,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other managers we can introduce in the future ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically speaking, we'd want managers that the community is actively using (and asks for). Likely Slurm would fit in here, and then others would come as requested. I am not the one to add Slurm, but someone that works on Slinky (the company SchedMD) or SUNK (Coreweave) could!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a listing of those that I've seen in the wild (aside from those mentioned already) there is PBS, Torque, OAR, LSF, Cobalt, SGE, and HTCondor. Others please feel free to add to this list!


While this approach offers the benefits of strong typing and explicit, self-documenting fields, there are several risks to be mitigated:

1. **Tight Coupling and API Brittleness:** This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would require a formal change to the Kubeflow Trainer API.

I guess, this problem can be addressed if we introduce settings field to the FluxPolicy, which allows dynamically add more settings supported by Flux.

Additionally, I would like to understand how frequent settings are changed/modified in Flux framework, and what changes are required in TrainJob controller to support them ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to maintain a very hardened interface, so (ideally) there would not be many changed. My suggested implementation (PR would come in after this) has a flexible map[string][string] for Settings (I forget atm if that is what I called it) that would allow any manager to arbitrary define and use settings without changing the interface here. There is just no way to define a common interface between managers beyond the name of the manager, and maybe a node count.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I would like to understand how frequent settings are changed/modified in Flux framework, and what changes are required in TrainJob controller to support them ?

For the Flux Operator, I have a strategy with two tiers:

  • Likely to unchange / commonly used: variables are added as part of the CRD definition (here would go under settings, and then require changes to the controller here).
  • Could change / less commonly used: A catch all "additional flags" type of variable.

I think we likely would do the same here, but putting the commonly used (unlikely to change) variables in the settings. E.g., flux is always going to have the basic flags for affinity and node topologies and gpus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggested implementation (PR would come in after this) has a flexible map[string][string] for Settings

How those settings are applied to the underlying Job? Can we just use env variables for it?

E.g., flux is always going to have the basic flags for affinity and node topologies and gpus.

If Flux is always going to have such parameters, why we don't want to add them directly in the spec, like we did for MPI: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L243

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Flux is always going to have such parameters, why we don't want to add them directly in the spec, like we did for MPI

Because those are specific to Flux, and may not be shared by every HPC workload manager that would use HPC policy. There is not a good way to standardize that beyond allow for the specific workload manager backend to look for what is required in the map and raise an error if it is missing.


1. **Tight Coupling and API Brittleness:** This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features.

2. **Lack of Extensibility:** The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future *without any changes to the API*. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will supporting a new workload managers require for us to update our extension framework with new plugin in any case? If yes, we still require to perform changes in the TrainJob controller.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? But we should talk about this more so I fully understand. It might not require any updates if we are using the same strategy to create objects, but there are things (plugins) to use with the Flux Operator that wouldn't make sense. E.g., the mpi plugin wouldn't make sense. Can you give another example or examples to think about?

Copy link
Member

@andreyvelich andreyvelich Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give another example or examples to think about?

IIUC, every manager might require custom orchestration to support that workload. For example, Flux plugin adds initContainer and ConfigMap with the init scripts. MPI plugin creates secret, ConfigMap with hostfile. LSF, Cobalt, or others might require different orchestration.

Which means it might be tricky to create a unify HPC plugin that covers all managers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, to be clear, my thinking was that HPCPolicy == covering a set of managers (plugins). Does the mapping between policy and manager have to be 1:1? I was using the manager name as the determining variable for when a plugin should trigger. Technically, any/all of them can be called, but then if the manager name doesn't match the plugin we return early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the mapping between policy and manager have to be 1:1?

I think so, for every manager we should create dedicated policy in the Extension Framework. Currently, the plugin is activated if API is configured in the MLPolicy: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L108-L109

I still don't see a lot of benefits to create unified hpcPolicy to cover multiple managers. For me the API looks cleaner if we have dedicated FluxPolicy with its configuration:

  mlPolicy:
    numNodes: 1
    flux:
      numProcPerNode: 5
      viewImage: ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy"

We can also, introduce settings APIs to give users more flexibility in terms of Flux settings:

flux:
  settings or env:
    - name: FLUX_KEY
      value: 123

Thoughts @kubeflow/kubeflow-trainer-team @vsoch ?

1. **API Trigger**: The user enables the feature by defining an `hpcPolicy` in their `TrainJob` runtime specification and setting the `manager` to `"flux"`.
2. **Plugin Activation**: The Kubeflow Trainer controller invokes the `Flux` plugin's `Build` method.
3. **JobSet Modification**: The plugin modifies the `JobSet` specification before it is created:
* An **`initContainer`** is added to the "trainer" replicated job. This container uses a pre-built "flux-view" image containing a Spack installation of Flux.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you require launcher + node rJobs for Flux, or we can have single rJob like in torch-distributed: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/torch_distributed.yaml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not require a launcher. It would be one definition, akin to torch distributed. The lead broker (that would be akin to the launcher) also acts as a worker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch Which entrypoint we will use to trigger the task? Will it still be mpirun ?

Copy link
Author

@vsoch vsoch Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mpirun is an analogous MPI launcher. Flux bootstraps MPI, so the entrypoint is flux run or flux submit (from inside the MiniCluster). You would use one or the other but not both.

Comment on lines 91 to 93
// Tasks is the number of tasks to provide to the workload manager.
// +optional
Tasks *int32 `json:"tasks,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pre-define number of Tasks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the HPC application you are running, you don't always want the tasks to == the number of cores on the node. You might halve the number if you have a multi-threaded instance and disable it. You might just define 1 task with --cores-per-task to equal most of the node. There is no way to know what the user wants. At least for the flux operator, you can also set this to 0 to not define any tasks.

But I think I know what you are asking - we could move Tasks into the generic settings, because it might not be relevant for all workload managers. I think that it might be, but no reason to enforce that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it similar to numProcPerNode parameter we use for MPI Plugin https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L249?

e.g. this value is equal to number of slots in MPI context.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it since we already have that defined, and if needed there can be a value in Settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use numProcPerNode field if we go towards FluxPolicy API, so we will be consistent with MPI and Torch.

name: lammps-flux-interactive
spec:
# Reference the pre-defined runtime by name
runtime: hpc-flux-runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this! I was having a really hard time figuring out what was right.

Comment on lines 151 to 155
template:
spec:
containers:
- name: node
image: "ghcr.io/converged-computing/metric-lammps:latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use .spec.trainer.image to set an image.


2. **Lack of Extensibility:** The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future *without any changes to the API*. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle.

3. **Challenges of a Common API:** The number of potential configuration options across different HPC resource managers is vast and highly heterogeneous. Attempting to create a "universal" `hpcPolicy` with hard-coded fields that abstracts concepts from Slurm, Flux, and LSF simultaneously would be a significant undertaking. Slurm's `sbatch` flags, Flux's broker options, and LSF's `bsub` commands do not map cleanly to a single, unified API structure. The `manager` and `settings` map approach gracefully sidesteps this problem by delegating manager-specific configuration to a flexible key-value store, which is the only practical way to support a diverse ecosystem of backends.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine folks might be interested to integrate Slurm into KF Trainer via dedicated runtime (e.g. SlurmRuntime): #2249, which is not JobSet based
But I am happy to discuss potential integrations.

cc @catblade

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! The Flux Operator is also not JobSet based (it uses an Indexed Job) but the design is simple enough that we can plug it into a JobSet (and actually I had an early branch that did just that). The reason I haven't completely ported it is because I don't like how it lengthens the DNS names to account for the extra layers.

@coveralls
Copy link

coveralls commented Nov 18, 2025

Pull Request Test Coverage Report for Build 19655755448

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.477%

Totals Coverage Status
Change from base Build 19655140149: 0.0%
Covered Lines: 1237
Relevant Lines: 2403

💛 - Coveralls

@andreyvelich
Copy link
Member

/ok-to-test

Changed crd examples to reflect documentation
removed tasks from definition - can go in settings
removed mentions of minicluster out of context
specified train image instead of custom logic
added user stories

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the kep-2841-add-flux-hpc branch from 8354e2b to 70533d6 Compare November 25, 2025 02:04
@vsoch vsoch changed the title KEP 2841: HPC Policy to support Flux Framework feat: KEP 2841 HPC Policy to support Flux Framework Nov 25, 2025
@vsoch vsoch requested a review from andreyvelich November 25, 2025 03:02
@vsoch
Copy link
Author

vsoch commented Nov 25, 2025

I think the error in CI is a flaky test? Note that I'm currently pushing for a more generic HPCPolicy that can support multiple plugin backends with a flexible Settings field. This means not using any hard coded variables (akin to the current MPI plugin).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Flux Framework as a plugin for HPC and MPI bootstrap

3 participants