feat(docs): Kubeflow Trainer ROADMAP 2026#3242
feat(docs): Kubeflow Trainer ROADMAP 2026#3242andreyvelich wants to merge 3 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive 2026 roadmap for Kubeflow Trainer, organizing planned features and enhancements into logical topic areas. The roadmap builds upon the 2025 roadmap already present in the file, providing a clear vision for the project's future direction.
Changes:
- Added a new "2026" section to ROADMAP.md with organized categories of planned features
- Included links to tracking issues for most roadmap items
- Grouped items by themes: distributed scheduling, MPI/HPC, observability, data cache, LLM fine-tuning, new runtimes, UI, and integrations
| - Enhanced Multi-Node NVLink Support | ||
| - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for | ||
| multi-cluster job dispatching, topology-aware scheduling, and other features. | ||
| - MPI and HPC on Kubernetes |
There was a problem hiding this comment.
This looks great! Flux supports Intel MPI and PMIX.
There was a problem hiding this comment.
Most of the issues stated in #2751 are not relevant with Flux.
| - Distributed Data Cache | ||
| - Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173 | ||
| - Integration with OptimizationJob | ||
| - Explore RDMA with AI Schedulers and Data Cache |
There was a problem hiding this comment.
RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.
There was a problem hiding this comment.
We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy.
We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200.
We can also explore how our MPI support might be helpful in this context.
| - Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML | ||
| frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750 | ||
| - Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648 | ||
| - Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238 |
There was a problem hiding this comment.
This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec!
ROADMAP.md
Outdated
| - Scheduling & Scalability | ||
| - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015 | ||
| - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628 | ||
| - Enhanced Multi-Node NVLink Support |
There was a problem hiding this comment.
What does this exactly mean? Expanding Trainer as workload scheduler?
There was a problem hiding this comment.
Not exactly. We've had several discussions with @Ronkahn21 about how to integrate TrainJob with the NVIDIA DRA Driver (e.g., ComputeDomain) to improve topology-aware placement for multi-node NVLink on GPUs like GB200
@Ronkahn21 will open an issue soon to track this work.
cc @klueska.
ROADMAP.md
Outdated
| - MPI and HPC on Kubernetes | ||
| - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841 | ||
| - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807 | ||
| - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12 |
There was a problem hiding this comment.
Please open trainer dedicted issue.
The mpi-operator and trainer should have separate mechanisms because they have no compatibility.
There was a problem hiding this comment.
Good point, I will create it soon!
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
f65276d to
d9f5cf6
Compare
I updated ROADMAP 2026 for Kubeflow Trainer. I tried to group them by multiple topics.
Please let me know what do you think, and if we should add more items 🚀
cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @jaiakash @akshaychitneni @robert-bell @vsoch @Ronkahn21 @EkinKarabulut @omer-dayan @kaisoz @kannon92 @mimowo @Fiona-Waters @abhijeet-dhumal
@bigsur0 @shravan-achar @Krishna-kg732 @XploY04 @aniket2405 @johnugeorge @kuizhiqing @franciscojavierarceo @eero-t @kwohlfahrt @stivanov-intercom @nqvuong1998 @trivialfis
/hold