File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 22
33## 2026
44
5- - Distributed AI Scheduling Enhancements
5+ - Scheduling & Scalability
66 - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
77 - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
88 - Enhanced Multi-Node NVLink Support
99 - First-Class Integration with [ Kueue] ( https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/ ) for
1010 multi-cluster job dispatching, topology-aware scheduling, and other features.
11+ - Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318
1112- MPI and HPC on Kubernetes
1213 - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
1314 - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
1415 - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12
1516 - Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751
16- - Observability and Reliability
17+ - Observability & Reliability
1718 - TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779
1819 - Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245
1920 - TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899
You can’t perform that action at this time.
0 commit comments