Skip to content

feat(docs): Kubeflow Trainer ROADMAP 2026#3242

Open
andreyvelich wants to merge 3 commits intokubeflow:masterfrom
andreyvelich:roadmap-2026
Open

feat(docs): Kubeflow Trainer ROADMAP 2026#3242
andreyvelich wants to merge 3 commits intokubeflow:masterfrom
andreyvelich:roadmap-2026

Conversation

@andreyvelich
Copy link
Member

I updated ROADMAP 2026 for Kubeflow Trainer. I tried to group them by multiple topics.

Please let me know what do you think, and if we should add more items 🚀

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @jaiakash @akshaychitneni @robert-bell @vsoch @Ronkahn21 @EkinKarabulut @omer-dayan @kaisoz @kannon92 @mimowo @Fiona-Waters @abhijeet-dhumal
@bigsur0 @shravan-achar @Krishna-kg732 @XploY04 @aniket2405 @johnugeorge @kuizhiqing @franciscojavierarceo @eero-t @kwohlfahrt @stivanov-intercom @nqvuong1998 @trivialfis

/hold

Copilot AI review requested due to automatic review settings February 24, 2026 00:00
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive 2026 roadmap for Kubeflow Trainer, organizing planned features and enhancements into logical topic areas. The roadmap builds upon the 2025 roadmap already present in the file, providing a clear vision for the project's future direction.

Changes:

  • Added a new "2026" section to ROADMAP.md with organized categories of planned features
  • Included links to tracking issues for most roadmap items
  • Grouped items by themes: distributed scheduling, MPI/HPC, observability, data cache, LLM fine-tuning, new runtimes, UI, and integrations

- Enhanced Multi-Node NVLink Support
- First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for
multi-cluster job dispatching, topology-aware scheduling, and other features.
- MPI and HPC on Kubernetes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Flux supports Intel MPI and PMIX.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the issues stated in #2751 are not relevant with Flux.

- Distributed Data Cache
- Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173
- Integration with OptimizationJob
- Explore RDMA with AI Schedulers and Data Cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.

Copy link
Member Author

@andreyvelich andreyvelich Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy.

We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200.

We can also explore how our MPI support might be helpful in this context.

- Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML
frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750
- Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648
- Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec!

ROADMAP.md Outdated
- Scheduling & Scalability
- Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
- KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
- Enhanced Multi-Node NVLink Support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this exactly mean? Expanding Trainer as workload scheduler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. We've had several discussions with @Ronkahn21 about how to integrate TrainJob with the NVIDIA DRA Driver (e.g., ComputeDomain) to improve topology-aware placement for multi-node NVLink on GPUs like GB200
@Ronkahn21 will open an issue soon to track this work.

cc @klueska.

ROADMAP.md Outdated
- MPI and HPC on Kubernetes
- Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
- IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
- PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open trainer dedicted issue.
The mpi-operator and trainer should have separate mechanisms because they have no compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will create it soon!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants