-
Notifications
You must be signed in to change notification settings - Fork 101
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What you would like to be added?
Implement a two-part enforcement system:
- Monitor: Track actual GPU resource usage (memory, SM utilization) at the container/process level
- Policy Enforcement: Automatically terminate containers that violate their resource reservations
Technical Challenges:
The policy enforcement part seems achievable, but the monitor part I'm not finding any project that can do it well (unless we must add instrumentation into end user code).
DCGM exporter can't see per process metrics.
You are NVIDIA, do you have any recommendations for:
- Existing tools or APIs that can provide per-container GPU usage metrics?
- Plans for enhancing DCGM or other NVIDIA tools to support this use case?
- Alternative approaches to achieve runtime GPU resource enforcement?
Why is this needed?
The current GPU sharing implementation in KAI-Scheduler faces a classic "Tragedy of the Commons" problem. While the scheduler successfully reserves GPU resources for pods during scheduling, there is no runtime enforcement to prevent containers from exceeding their allocated limits. This creates several risks:
- Resource abuse: Containers can consume more GPU memory or compute than allocated, starving other workloads
- Unpredictable performance: Shared GPU workloads may experience degraded performance due to resource contention
- System instability: GPU memory exhaustion can cause CUDA out-of-memory errors affecting all containers on the GPU
romanbaron
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request