Skip to content

Add GPU resource enforcement to prevent container usage violations #423

@eveningcafe

Description

@eveningcafe

What you would like to be added?

Implement a two-part enforcement system:

  1. Monitor: Track actual GPU resource usage (memory, SM utilization) at the container/process level
  2. Policy Enforcement: Automatically terminate containers that violate their resource reservations

Technical Challenges:

The policy enforcement part seems achievable, but the monitor part I'm not finding any project that can do it well (unless we must add instrumentation into end user code).
DCGM exporter can't see per process metrics.

You are NVIDIA, do you have any recommendations for:

  1. Existing tools or APIs that can provide per-container GPU usage metrics?
  2. Plans for enhancing DCGM or other NVIDIA tools to support this use case?
  3. Alternative approaches to achieve runtime GPU resource enforcement?

Why is this needed?

The current GPU sharing implementation in KAI-Scheduler faces a classic "Tragedy of the Commons" problem. While the scheduler successfully reserves GPU resources for pods during scheduling, there is no runtime enforcement to prevent containers from exceeding their allocated limits. This creates several risks:

  1. Resource abuse: Containers can consume more GPU memory or compute than allocated, starving other workloads
  2. Unpredictable performance: Shared GPU workloads may experience degraded performance due to resource contention
  3. System instability: GPU memory exhaustion can cause CUDA out-of-memory errors affecting all containers on the GPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions