Skip to content

[Feature]: Support for preflight/prolog checks #658

@lalitadithya

Description

@lalitadithya

Prerequisites

  • I searched existing issues

Feature Summary

As a part of Kubernetes 1.35, support for gang scheduling was introduced -- https://kubernetes.io/blog/2025/12/29/kubernetes-v1-35-introducing-workload-aware-scheduling/. As a part of this support, the pod spec now has a reference to the gang that the pod is a part of:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: some-ns
spec:
  workloadRef:
    name: training-job-workload
    podGroup: workers

Using workloadRef NVSentinel should be able to inject an init container that can run a variety of checks including but not limited to:

  • DCGM diag
  • NCCL loopback
  • NCCL all-reduce across a gang
  • Pluggable third party checks

Problem/Use Case

As an operator, I want to run preflight checks before a pod starts that allow me to run active health checks to give the user jobs a higher chance of success

Proposed Solution

TBD

Component

Health Monitor

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions