-
Notifications
You must be signed in to change notification settings - Fork 841
feat: KEP 2841 HPC Policy to support Flux Framework #2909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Signed-off-by: vsoch <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
andreyvelich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for driving this great feature @vsoch, and sorry for the delay, got swamped with the KubeCon. I left my initial thoughts.
cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team
| // Manager specifies the workload manager to use. | ||
| // +kubebuilder:default="flux" | ||
| // +optional | ||
| Manager string `json:"manager,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other managers we can introduce in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practically speaking, we'd want managers that the community is actively using (and asks for). Likely Slurm would fit in here, and then others would come as requested. I am not the one to add Slurm, but someone that works on Slinky (the company SchedMD) or SUNK (Coreweave) could!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a listing of those that I've seen in the wild (aside from those mentioned already) there is PBS, Torque, OAR, LSF, Cobalt, SGE, and HTCondor. Others please feel free to add to this list!
|
|
||
| While this approach offers the benefits of strong typing and explicit, self-documenting fields, there are several risks to be mitigated: | ||
|
|
||
| 1. **Tight Coupling and API Brittleness:** This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would require a formal change to the Kubeflow Trainer API.
I guess, this problem can be addressed if we introduce settings field to the FluxPolicy, which allows dynamically add more settings supported by Flux.
Additionally, I would like to understand how frequent settings are changed/modified in Flux framework, and what changes are required in TrainJob controller to support them ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to maintain a very hardened interface, so (ideally) there would not be many changed. My suggested implementation (PR would come in after this) has a flexible map[string][string] for Settings (I forget atm if that is what I called it) that would allow any manager to arbitrary define and use settings without changing the interface here. There is just no way to define a common interface between managers beyond the name of the manager, and maybe a node count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I would like to understand how frequent settings are changed/modified in Flux framework, and what changes are required in TrainJob controller to support them ?
For the Flux Operator, I have a strategy with two tiers:
- Likely to unchange / commonly used: variables are added as part of the CRD definition (here would go under settings, and then require changes to the controller here).
- Could change / less commonly used: A catch all "additional flags" type of variable.
I think we likely would do the same here, but putting the commonly used (unlikely to change) variables in the settings. E.g., flux is always going to have the basic flags for affinity and node topologies and gpus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggested implementation (PR would come in after this) has a flexible map[string][string] for Settings
How those settings are applied to the underlying Job? Can we just use env variables for it?
E.g., flux is always going to have the basic flags for affinity and node topologies and gpus.
If Flux is always going to have such parameters, why we don't want to add them directly in the spec, like we did for MPI: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L243
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Flux is always going to have such parameters, why we don't want to add them directly in the spec, like we did for MPI
Because those are specific to Flux, and may not be shared by every HPC workload manager that would use HPC policy. There is not a good way to standardize that beyond allow for the specific workload manager backend to look for what is required in the map and raise an error if it is missing.
|
|
||
| 1. **Tight Coupling and API Brittleness:** This design tightly couples the Kubeflow Trainer API to the specific implementation details of Flux Framework. If Flux itself introduces a new, important configuration option, it would require a formal change to the Kubeflow Trainer API. E.g., if we make additional options to Flux a string field, that serves as a mitigation for API brittleness. Requiring each flag or option to be a hardened field would make future development harder and require a new release. This creates a high maintenance burden and slows down the adoption of new features. | ||
|
|
||
| 2. **Lack of Extensibility:** The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future *without any changes to the API*. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will supporting a new workload managers require for us to update our extension framework with new plugin in any case? If yes, we still require to perform changes in the TrainJob controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? But we should talk about this more so I fully understand. It might not require any updates if we are using the same strategy to create objects, but there are things (plugins) to use with the Flux Operator that wouldn't make sense. E.g., the mpi plugin wouldn't make sense. Can you give another example or examples to think about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give another example or examples to think about?
IIUC, every manager might require custom orchestration to support that workload. For example, Flux plugin adds initContainer and ConfigMap with the init scripts. MPI plugin creates secret, ConfigMap with hostfile. LSF, Cobalt, or others might require different orchestration.
Which means it might be tricky to create a unify HPC plugin that covers all managers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, to be clear, my thinking was that HPCPolicy == covering a set of managers (plugins). Does the mapping between policy and manager have to be 1:1? I was using the manager name as the determining variable for when a plugin should trigger. Technically, any/all of them can be called, but then if the manager name doesn't match the plugin we return early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the mapping between policy and manager have to be 1:1?
I think so, for every manager we should create dedicated policy in the Extension Framework. Currently, the plugin is activated if API is configured in the MLPolicy: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/framework/plugins/mpi/mpi.go#L108-L109
I still don't see a lot of benefits to create unified hpcPolicy to cover multiple managers. For me the API looks cleaner if we have dedicated FluxPolicy with its configuration:
mlPolicy:
numNodes: 1
flux:
numProcPerNode: 5
viewImage: ghcr.io/converged-computing/flux-view-ubuntu:tag-jammy"We can also, introduce settings APIs to give users more flexibility in terms of Flux settings:
flux:
settings or env:
- name: FLUX_KEY
value: 123Thoughts @kubeflow/kubeflow-trainer-team @vsoch ?
| 1. **API Trigger**: The user enables the feature by defining an `hpcPolicy` in their `TrainJob` runtime specification and setting the `manager` to `"flux"`. | ||
| 2. **Plugin Activation**: The Kubeflow Trainer controller invokes the `Flux` plugin's `Build` method. | ||
| 3. **JobSet Modification**: The plugin modifies the `JobSet` specification before it is created: | ||
| * An **`initContainer`** is added to the "trainer" replicated job. This container uses a pre-built "flux-view" image containing a Spack installation of Flux. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you require launcher + node rJobs for Flux, or we can have single rJob like in torch-distributed: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/torch_distributed.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not require a launcher. It would be one definition, akin to torch distributed. The lead broker (that would be akin to the launcher) also acts as a worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsoch Which entrypoint we will use to trigger the task? Will it still be mpirun ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mpirun is an analogous MPI launcher. Flux bootstraps MPI, so the entrypoint is flux run or flux submit (from inside the MiniCluster). You would use one or the other but not both.
| // Tasks is the number of tasks to provide to the workload manager. | ||
| // +optional | ||
| Tasks *int32 `json:"tasks,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pre-define number of Tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the HPC application you are running, you don't always want the tasks to == the number of cores on the node. You might halve the number if you have a multi-threaded instance and disable it. You might just define 1 task with --cores-per-task to equal most of the node. There is no way to know what the user wants. At least for the flux operator, you can also set this to 0 to not define any tasks.
But I think I know what you are asking - we could move Tasks into the generic settings, because it might not be relevant for all workload managers. I think that it might be, but no reason to enforce that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it similar to numProcPerNode parameter we use for MPI Plugin https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainingruntime_types.go#L249?
e.g. this value is equal to number of slots in MPI context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's exactly it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it since we already have that defined, and if needed there can be a value in Settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use numProcPerNode field if we go towards FluxPolicy API, so we will be consistent with MPI and Torch.
| name: lammps-flux-interactive | ||
| spec: | ||
| # Reference the pre-defined runtime by name | ||
| runtime: hpc-flux-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have correct spec, check example here: https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime/#example-of-clustertrainingruntime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate this! I was having a really hard time figuring out what was right.
| template: | ||
| spec: | ||
| containers: | ||
| - name: node | ||
| image: "ghcr.io/converged-computing/metric-lammps:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use .spec.trainer.image to set an image.
|
|
||
| 2. **Lack of Extensibility:** The primary risk is inflexibility. The `hpcPolicy` with a generic `manager` and `settings` map was chosen because it can support other HPC workload managers (like Slurm, LSF, PBS) in the future *without any changes to the API*. A hard-coded approach would necessitate adding a new `slurmPolicy`, `lsfPolicy`, etc., for each new backend, leading to significant API bloat and violating the Open/Closed Principle. | ||
|
|
||
| 3. **Challenges of a Common API:** The number of potential configuration options across different HPC resource managers is vast and highly heterogeneous. Attempting to create a "universal" `hpcPolicy` with hard-coded fields that abstracts concepts from Slurm, Flux, and LSF simultaneously would be a significant undertaking. Slurm's `sbatch` flags, Flux's broker options, and LSF's `bsub` commands do not map cleanly to a single, unified API structure. The `manager` and `settings` map approach gracefully sidesteps this problem by delegating manager-specific configuration to a flexible key-value store, which is the only practical way to support a diverse ecosystem of backends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely! The Flux Operator is also not JobSet based (it uses an Indexed Job) but the design is simple enough that we can plug it into a JobSet (and actually I had an early branch that did just that). The reason I haven't completely ported it is because I don't like how it lengthens the DNS names to account for the extra layers.
Pull Request Test Coverage Report for Build 19655755448Details
💛 - Coveralls |
|
/ok-to-test |
Changed crd examples to reflect documentation removed tasks from definition - can go in settings removed mentions of minicluster out of context specified train image instead of custom logic added user stories Signed-off-by: vsoch <[email protected]>
8354e2b to
70533d6
Compare
|
I think the error in CI is a flaky test? Note that I'm currently pushing for a more generic HPCPolicy that can support multiple plugin backends with a flexible Settings field. This means not using any hard coded variables (akin to the current MPI plugin). |
What this PR does / why we need it:
This KEP proposes adding an hpcPolicy to support Flux Framework and (in the future) other workload managers that provide more traditional HPC features. Using an HPC workload manager like Flux to bootstrap MPI will empower users to run MPI-based and other distributed workloads with advanced scheduling, topology awareness, and a more robust bootstrapping mechanism than traditional SSH-based methods. The proposal introduces a new, extensible
hpcPolicyin theTrainJobAPI, allowing users to select and configure an HPC workload manager, with Flux being the first implementation.The WIP implementation for the design discussed.
Authors
Myself and @milroy
Which issue(s) this PR fixes This will fix #2841.
Ping @andreyvelich @astefanutti
Checklist: