-
Notifications
You must be signed in to change notification settings - Fork 841
Description
What you would like to be added?
Flux Framework is a graph-based workload manager that can handle advanced scheduling, but in its simplest form, can serve to bootstrap MPI (any flavor) using zeromq, which is more robust than standard ssh-based methods and does not require the abstraction of a launcher and worker nodes (akin to the MPI Operator). I am creating this issue to discuss two early ideas for integration into the Kubeflow Trainer:
- Flux as an MPI bootstrap mechanism
- Flux as a job execution backend (akin to JobSet)
I will discuss both below, and I'd like to follow up with pull requests for the discussed implementation. Thank you in advance for feedback!
Flux as an MPI Bootstrap Mechanism
A logical addition to Kubeflow Trainer would be to add support for a Flux-based MPI plugin, akin to the current MPI Plugin that appears to work in contexts that require a bootstrap. It looks like the current requirement for this plugin is that the container come prepared with an MPI user, etc and ideally we could have an implementation with Flux that does not set a hard requirement beyond having an MPI-enabled application.
It looks like the current MPI plugin is primarily concerned with using existing secrets from the cluster and generating the hostmap file for MPI. What we would want to do here is the minimal work to ensure that flux components are installed (note this is possible with conda/mamba and similar) and then configured. I think it can be done with a few extra install commands and a wrapper to the entrypoint (see suggested PRs below).
Flux as a Job Management Plugin
Akin to JobSet here, we can easily create a Flux MiniCluster CRD that will wrap a user container and handle running an MPI application. If I'm reading this correctly, we are basically mapping the train job into a JobSet via what is called a JobSet Builder, and we could do the same with the Flux MiniCluster. High level, the Flux MiniCluster will enable training with MPI bootstrap, a sophisticated job queue (if needed) and fine grained topology mapping / scheduling for jobs that extend into HPC use cases.
This is likely the lowest hanging fruit (read, easiest) in that it would plug right in to most MPI application use cases, and we've done it before in different contexts. It would create, instead of JobSet, a Flux MiniCluster that is essentially an Indexed Job with a few config maps and an init container that will allow execution of an MPI application with Flux. There are other features this enables for scheduling, but I consider this extended discussion out of scope for the issue.
Questions I have
- Does Kubeflow have use cases that extend into the HPC space? It seems logical to start with an ML workload that can use MPI, but we'd be interested in the bucket of AI/ML related apps (think simulation that feeds into a model) that aren't well covered here. From our side, we have a workload called MuMMI that has an ML job step that generates data from an initial model for subsequent simulation.
- Can you point me to a good JobSet example: I would first want to reproduce a simple, working example with JobSet that I can modify to run with Flux. Ideally it should not require GPU, which is harder for me to access.
- What are use cases for coscheduling I'm looking at the plugins here and I noticed there is one for coscheduling! From reading there, it looks like its using this API to create a PodGroup for a train Job, and then I suspect that uses the same logic from the initial scheduler plugins repository to ensure they are scheduled as a group (sorted by name and namespace). I'm wondering if we had a similar plugin that uses the Flux scheduler, if it would be amenable to try using here. What are the hard cases for scheduling of Trainer jobs - just getting them running as a group? Topology? Something else?
Suggested follow up PRs
- Add the Flux Operator as a plugin. This would mean optionally adding it to be installed here and/or implementing a plugin akin to JobSet.
- Add the Flux Operator to bootstrap. This would be similar in spirit to the Flux Operator, but an attempt to do the entire bootstrap without the addition of an operator. @andreyvelich this is something we talked about last year (or the year before)? and I implemented via a basic library that does the entire install and configuration of bootstrap with a few lines, and that can be done with a JobSet or Job abstraction. I would want to discuss how to best integrate this into the framework as a plugin.
I'd be interested in the interest for the above ideas, and other ideas for Flux integration. The first is more straight forward for me to implement, and the second I'd be curious about the best way to handle what comes down to taking an initial trainer job and adding extra commands to install a few dependencies and then wrap an execution command.
Pinging @milroy to add comment and who likely will work with me on these.
Why is this needed?
The Kubeflow Trainer ecosystem should have better (more flexible, feature rich) support for AI/ML workloads that have an MPI component, and more generally speaking, HPC-oriented use cases.
Love this feature?
Give it a 👍 We prioritize the features with most 👍