feat: KEP 2841 HPC Policy to support Flux Framework #2909

vsoch · 2025-10-31T02:09:40Z

What this PR does / why we need it:

This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Using an HPC workload manager like Flux to bootstrap MPI will empower users to run MPI-based and other distributed workloads with advanced scheduling, topology awareness, and a more robust bootstrapping mechanism than traditional SSH-based methods. The proposal introduces a new, extensible hpcPolicy in the TrainJob API, allowing users to select and configure an HPC workload manager, with Flux being the first implementation.

The WIP implementation for the design discussed.

Authors

Myself and @milroy

Which issue(s) this PR fixes This will fix #2841.

Ping @andreyvelich @astefanutti

Checklist:

Docs included if any changes are user facing

This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <[email protected]>

google-oss-prow · 2025-10-31T02:09:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich

Thanks for driving this great feature @vsoch, and sorry for the delay, got swamped with the KubeCon. I left my initial thoughts.

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team

andreyvelich · 2025-11-18T03:07:47Z

docs/proposals/2841-flux-hpc/README.md

+	// Manager specifies the workload manager to use.
+	// +kubebuilder:default="flux"
+	// +optional
+	Manager string `json:"manager,omitempty"`


What other managers we can introduce in the future ?

Practically speaking, we'd want managers that the community is actively using (and asks for). Likely Slurm would fit in here, and then others would come as requested. I am not the one to add Slurm, but someone that works on Slinky (the company SchedMD) or SUNK (Coreweave) could!

For a listing of those that I've seen in the wild (aside from those mentioned already) there is PBS, Torque, OAR, LSF, Cobalt, SGE, and HTCondor. Others please feel free to add to this list!

andreyvelich · 2025-11-18T03:10:34Z

docs/proposals/2841-flux-hpc/README.md

+
+While this approach offers the benefits of strong typing and explicit, self-documenting fields, there are several risks to be mitigated:
+
+1.  **Tight Coupling and API Brittleness:** This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features.


it would require a formal change to the Kubeflow Trainer API.

I guess, this problem can be addressed if we introduce settings field to the FluxPolicy, which allows dynamically add more settings supported by Flux.

Additionally, I would like to understand how frequent settings are changed/modified in Flux framework, and what changes are required in TrainJob controller to support them ?

We tend to maintain a very hardened interface, so (ideally) there would not be many changed. My suggested implementation (PR would come in after this) has a flexible map[string][string] for Settings (I forget atm if that is what I called it) that would allow any manager to arbitrary define and use settings without changing the interface here. There is just no way to define a common interface between managers beyond the name of the manager, and maybe a node count.

Additionally, I would like to understand how frequent settings are changed/modified in Flux framework, and what changes are required in TrainJob controller to support them ?

For the Flux Operator, I have a strategy with two tiers:

Likely to unchange / commonly used: variables are added as part of the CRD definition (here would go under settings, and then require changes to the controller here).

Could change / less commonly used: A catch all "additional flags" type of variable.

I think we likely would do the same here, but putting the commonly used (unlikely to change) variables in the settings. E.g., flux is always going to have the basic flags for affinity and node topologies and gpus.

My suggested implementation (PR would come in after this) has a flexible map[string][string] for Settings

How those settings are applied to the underlying Job? Can we just use env variables for it?

E.g., flux is always going to have the basic flags for affinity and node topologies and gpus.

If Flux is always going to have such parameters, why we don't want to add them directly in the spec, like we did for MPI: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L243

If Flux is always going to have such parameters, why we don't want to add them directly in the spec, like we did for MPI

Because those are specific to Flux, and may not be shared by every HPC workload manager that would use HPC policy. There is not a good way to standardize that beyond allow for the specific workload manager backend to look for what is required in the map and raise an error if it is missing.

andreyvelich · 2025-11-18T03:12:41Z

docs/proposals/2841-flux-hpc/README.md

+
+1.  **Tight Coupling and API Brittleness:** This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features.
+
+2.  **Lack of Extensibility:** The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future *without any changes to the API*. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle.


Will supporting a new workload managers require for us to update our extension framework with new plugin in any case? If yes, we still require to perform changes in the TrainJob controller.

I don't think so? But we should talk about this more so I fully understand. It might not require any updates if we are using the same strategy to create objects, but there are things (plugins) to use with the Flux Operator that wouldn't make sense. E.g., the mpi plugin wouldn't make sense. Can you give another example or examples to think about?

Can you give another example or examples to think about?

IIUC, every manager might require custom orchestration to support that workload. For example, Flux plugin adds initContainer and ConfigMap with the init scripts. MPI plugin creates secret, ConfigMap with hostfile. LSF, Cobalt, or others might require different orchestration.

Which means it might be tricky to create a unify HPC plugin that covers all managers.

Oh, to be clear, my thinking was that HPCPolicy == covering a set of managers (plugins). Does the mapping between policy and manager have to be 1:1? I was using the manager name as the determining variable for when a plugin should trigger. Technically, any/all of them can be called, but then if the manager name doesn't match the plugin we return early.

Does the mapping between policy and manager have to be 1:1?

I think so, for every manager we should create dedicated policy in the Extension Framework. Currently, the plugin is activated if API is configured in the MLPolicy: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L108-L109

I still don't see a lot of benefits to create unified hpcPolicy to cover multiple managers. For me the API looks cleaner if we have dedicated FluxPolicy with its configuration:

mlPolicy: numNodes: 1 flux: numProcPerNode: 5 viewImage: ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy"

We can also, introduce settings APIs to give users more flexibility in terms of Flux settings:

flux: settings or env: - name: FLUX_KEY value: 123

Thoughts @kubeflow/kubeflow-trainer-team @vsoch ?

docs/proposals/2841-flux-hpc/README.md

andreyvelich · 2025-11-18T03:16:12Z

docs/proposals/2841-flux-hpc/README.md

+1.  **API Trigger**: The user enables the feature by defining an `hpcPolicy` in their `TrainJob` runtime specification and setting the `manager` to `"flux"`.
+2.  **Plugin Activation**: The Kubeflow Trainer controller invokes the `Flux` plugin's `Build` method.
+3.  **JobSet Modification**: The plugin modifies the `JobSet` specification before it is created:
+    *   An **`initContainer`** is added to the "trainer" replicated job. This container uses a pre-built "flux-view" image containing a Spack installation of Flux.


Do you require launcher + node rJobs for Flux, or we can have single rJob like in torch-distributed: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/torch_distributed.yaml

We do not require a launcher. It would be one definition, akin to torch distributed. The lead broker (that would be akin to the launcher) also acts as a worker.

@vsoch Which entrypoint we will use to trigger the task? Will it still be mpirun ?

mpirun is an analogous MPI launcher. Flux bootstraps MPI, so the entrypoint is flux run or flux submit (from inside the MiniCluster). You would use one or the other but not both.

andreyvelich · 2025-11-18T03:24:13Z

docs/proposals/2841-flux-hpc/README.md

+	// Tasks is the number of tasks to provide to the workload manager.
+	// +optional
+	Tasks *int32 `json:"tasks,omitempty"`


Why do we need to pre-define number of Tasks?

Depending on the HPC application you are running, you don't always want the tasks to == the number of cores on the node. You might halve the number if you have a multi-threaded instance and disable it. You might just define 1 task with --cores-per-task to equal most of the node. There is no way to know what the user wants. At least for the flux operator, you can also set this to 0 to not define any tasks.

But I think I know what you are asking - we could move Tasks into the generic settings, because it might not be relevant for all workload managers. I think that it might be, but no reason to enforce that.

Is it similar to numProcPerNode parameter we use for MPI Plugin https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L249?

e.g. this value is equal to number of slots in MPI context.

Yes that's exactly it.

I removed it since we already have that defined, and if needed there can be a value in Settings.

I would prefer to use numProcPerNode field if we go towards FluxPolicy API, so we will be consistent with MPI and Torch.

docs/proposals/2841-flux-hpc/README.md

andreyvelich · 2025-11-18T03:25:48Z

docs/proposals/2841-flux-hpc/README.md

+  name: lammps-flux-interactive
+spec:
+  # Reference the pre-defined runtime by name
+  runtime: hpc-flux-runtime


This should have correct spec, check example here: https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime/#example-of-clustertrainingruntime

I appreciate this! I was having a really hard time figuring out what was right.

andreyvelich · 2025-11-18T03:26:16Z

docs/proposals/2841-flux-hpc/README.md

+    template:
+      spec:
+        containers:
+        - name: node
+          image: "ghcr.io/converged-computing/metric-lammps:latest"


You can just use .spec.trainer.image to set an image.

andreyvelich · 2025-11-18T03:28:27Z

docs/proposals/2841-flux-hpc/README.md

+
+2.  **Lack of Extensibility:** The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future *without any changes to the API*. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle.
+
+3.  **Challenges of a Common API:** The number of potential configuration options across different HPC resource managers is vast and highly heterogeneous. Attempting to create a "universal" `hpcPolicy` with hard-coded fields that abstracts concepts from Slurm, Flux, and LSF simultaneously would be a significant undertaking. Slurm's `sbatch` flags, Flux's broker options, and LSF's `bsub` commands do not map cleanly to a single, unified API structure. The `manager` and `settings` map approach gracefully sidesteps this problem by delegating manager-specific configuration to a flexible key-value store, which is the only practical way to support a diverse ecosystem of backends.


I would imagine folks might be interested to integrate Slurm into KF Trainer via dedicated runtime (e.g. SlurmRuntime): #2249, which is not JobSet based
But I am happy to discuss potential integrations.

cc @catblade

Definitely! The Flux Operator is also not JobSet based (it uses an Indexed Job) but the design is simple enough that we can plug it into a JobSet (and actually I had an early branch that did just that). The reason I haven't completely ported it is because I don't like how it lengthens the DNS names to account for the extra layers.

coveralls · 2025-11-18T03:35:26Z

Pull Request Test Coverage Report for Build 19655755448

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.477%

Totals
Change from base Build 19655140149:	0.0%
Covered Lines:	1237
Relevant Lines:	2403

💛 - Coveralls

andreyvelich · 2025-11-18T13:48:36Z

/ok-to-test

Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <[email protected]>

vsoch · 2025-11-25T03:03:46Z

I think the error in CI is a flaky test? Note that I'm currently pushing for a more generic HPCPolicy that can support multiple plugin backends with a flexible Settings field. This means not using any hard coded variables (akin to the current MPI plugin).

feat: kep for flux hpc (2841)

28f9140

This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <[email protected]>

google-oss-prow bot requested review from jinchihe and kuizhiqing October 31, 2025 02:09

google-oss-prow bot added the size/L label Oct 31, 2025

kannon92 mentioned this pull request Oct 31, 2025

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

Open

andreyvelich mentioned this pull request Nov 5, 2025

Add documentation for Kubeflow Trainer v2 TrainJob integration with Kueue kubernetes-sigs/kueue#7533

Merged

andreyvelich reviewed Nov 18, 2025

View reviewed changes

google-oss-prow bot added the ok-to-test label Nov 18, 2025

review: see updates below.

70533d6

Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <[email protected]>

vsoch force-pushed the kep-2841-add-flux-hpc branch from 8354e2b to 70533d6 Compare November 25, 2025 02:04

vsoch changed the title ~~KEP 2841: HPC Policy to support Flux Framework~~ feat: KEP 2841 HPC Policy to support Flux Framework Nov 25, 2025

vsoch requested a review from andreyvelich November 25, 2025 03:02


		While this approach offers the benefits of strong typing and explicit, self-documenting fields, there are several risks to be mitigated:

		1. Tight Coupling and API Brittleness: This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features.


		1. Tight Coupling and API Brittleness: This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features.

		2. Lack of Extensibility: The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future without any changes to the API. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle.


		2. Lack of Extensibility: The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future without any changes to the API. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle.

		3. Challenges of a Common API: The number of potential configuration options across different HPC resource managers is vast and highly heterogeneous. Attempting to create a "universal" `hpcPolicy` with hard-coded fields that abstracts concepts from Slurm, Flux, and LSF simultaneously would be a significant undertaking. Slurm's `sbatch` flags, Flux's broker options, and LSF's `bsub` commands do not map cleanly to a single, unified API structure. The `manager` and `settings` map approach gracefully sidesteps this problem by delegating manager-specific configuration to a flexible key-value store, which is the only practical way to support a diverse ecosystem of backends.

feat: KEP 2841 HPC Policy to support Flux Framework #2909

Are you sure you want to change the base?

feat: KEP 2841 HPC Policy to support Flux Framework #2909

Conversation

vsoch commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Oct 31, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vsoch Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19655755448

Details

💛 - Coveralls

Uh oh!

vsoch commented Oct 31, 2025 •

edited

Loading

andreyvelich Nov 18, 2025 •

edited

Loading

vsoch Nov 21, 2025 •

edited

Loading

coveralls commented Nov 18, 2025 •

edited

Loading