-
Notifications
You must be signed in to change notification settings - Fork 33
Labels
enhancementNew feature or requestNew feature or request
Description
Prerequisites
- I searched existing issues
Feature Summary
As a part of Kubernetes 1.35, support for gang scheduling was introduced -- https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/. As a part of this support, the pod spec now has a reference to the gang that the pod is a part of:
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
name: training-job-workload
podGroup: workers
Using workloadRef NVSentinel should be able to inject an init container that can run a variety of checks including but not limited to:
- DCGM diag
- NCCL loopback
- NCCL all-reduce across a gang
- Pluggable third party checks
Problem/Use Case
As an operator, I want to run preflight checks before a pod starts that allow me to run active health checks to give the user jobs a higher chance of success
Proposed Solution
TBD
Component
Health Monitor
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request