Skip to content

Commit da0f3d5

Browse files
committed
Add Scalability feature
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
1 parent 091a738 commit da0f3d5

1 file changed

Lines changed: 3 additions & 2 deletions

File tree

ROADMAP.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,19 @@
22

33
## 2026
44

5-
- Distributed AI Scheduling Enhancements
5+
- Scheduling & Scalability
66
- Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
77
- KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
88
- Enhanced Multi-Node NVLink Support
99
- First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for
1010
multi-cluster job dispatching, topology-aware scheduling, and other features.
11+
- Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318
1112
- MPI and HPC on Kubernetes
1213
- Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
1314
- IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
1415
- PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12
1516
- Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751
16-
- Observability and Reliability
17+
- Observability & Reliability
1718
- TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779
1819
- Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245
1920
- TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899

0 commit comments

Comments
 (0)