Add GPU resource enforcement to prevent container usage violations

### What you would like to be added?

Implement a two-part enforcement system:
  1. Monitor: Track actual GPU resource usage (memory, SM utilization) at the container/process level
  2. Policy Enforcement: Automatically terminate containers that violate their resource reservations

Technical Challenges:

  The policy enforcement part seems achievable, but the monitor part I'm not finding any project that can do it well (unless we must add instrumentation into end user code).
   DCGM exporter can't see per process metrics.

You are NVIDIA, do you have any recommendations for:
  1. Existing tools or APIs that can provide per-container GPU usage metrics?
  2. Plans for enhancing DCGM or other NVIDIA tools to support this use case?
  3. Alternative approaches to achieve runtime GPU resource enforcement?

### Why is this needed?

 The current GPU sharing implementation in KAI-Scheduler faces a classic "Tragedy of the Commons" problem. While the scheduler successfully reserves GPU resources for pods during scheduling, there is no runtime enforcement to prevent containers from exceeding their allocated limits. This creates several risks:

  1. Resource abuse: Containers can consume more GPU memory or compute than allocated, starving other workloads
  2. Unpredictable performance: Shared GPU workloads may experience degraded performance due to resource contention
  3. System instability: GPU memory exhaustion can cause CUDA out-of-memory errors affecting all containers on the GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPU resource enforcement to prevent container usage violations #423

What you would like to be added?

Why is this needed?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add GPU resource enforcement to prevent container usage violations #423

Description

What you would like to be added?

Why is this needed?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions