Skip to content

Commit dc254b0

Browse files
feat(docs): Kubeflow Trainer ROADMAP 2026 (#3242)
* feat(docs): Kubeflow Trainer ROADMAP 2026 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Scalability feature Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issue for Multi-Node NVLink Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add item for Runtime lifecycle management Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> * Add Observability items Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
1 parent d4546f4 commit dc254b0

1 file changed

Lines changed: 39 additions & 0 deletions

File tree

ROADMAP.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,44 @@
11
# Kubeflow Trainer ROADMAP
22

3+
## 2026
4+
5+
- Scheduling & Scalability
6+
- Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
7+
- KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
8+
- Support Multi-Node NVLink (MNNVL) for TrainJob: https://github.com/kubeflow/trainer/issues/3264
9+
- First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for
10+
multi-cluster job dispatching, topology-aware scheduling, and other features.
11+
- Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318
12+
- MPI and HPC on Kubernetes
13+
- Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
14+
- IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
15+
- PMIx Investigation with Flux or Slurm plugins
16+
- Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751
17+
- Observability & Reliability
18+
- TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779
19+
- Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245
20+
- TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899
21+
- Elastic TrainJobs: https://github.com/kubeflow/trainer/issues/2903
22+
- Add controller-level Prometheus metrics and ServiceMonitor: https://github.com/kubeflow/trainer/issues/3429
23+
- Default Grafana dashboard for Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/3430
24+
- Distributed Data Cache
25+
- Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173
26+
- Integration with OptimizationJob
27+
- Explore RDMA with AI Schedulers and Data Cache
28+
- LLM Fine-Tuning Enhancements
29+
- Automatic configuration of GPU requests for TrainJobs: https://github.com/kubeflow/trainer/issues/3328
30+
- Build Dynamic BuiltinTrainers and LLM Fine-Tuning Blueprints: https://github.com/kubeflow/trainer/issues/2839
31+
- New Kubeflow Trainer Runtimes
32+
- Distributed JAX: https://github.com/kubeflow/trainer/issues/2442
33+
- Distributed XGBoost: https://github.com/kubeflow/trainer/issues/2598
34+
- Tensor Parallelism with Megatron-LM: https://github.com/kubeflow/trainer/issues/3178
35+
- Slurm Runtime Integration: https://github.com/kubeflow/trainer/issues/2249
36+
- Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML
37+
frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750
38+
- Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648
39+
- Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238
40+
- Enhance lifecycle management and mutability of Runtimes: https://github.com/kubeflow/trainer/pull/3428
41+
342
## 2025
443

544
- Kubeflow Trainer v2 general availability: https://github.com/kubeflow/trainer/issues/2170

0 commit comments

Comments
 (0)