File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 55- Scheduling & Scalability
66 - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
77 - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
8- - Enhanced Multi-Node NVLink Support
8+ - Support Multi-Node NVLink (MNNVL) for TrainJob: https://github.com/kubeflow/trainer/issues/3264
99 - First-Class Integration with [ Kueue] ( https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/ ) for
1010 multi-cluster job dispatching, topology-aware scheduling, and other features.
1111 - Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318
1212- MPI and HPC on Kubernetes
1313 - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
1414 - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
15- - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12
15+ - PMIx Investigation with Flux or Slurm plugins
1616 - Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751
1717- Observability & Reliability
1818 - TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779
You can’t perform that action at this time.
0 commit comments