Skip to content

Project-HAMi/KAI-resource-isolator

Repository files navigation

kai-resource-isolator

kai-resource-isolator works alongside KAI-Scheduler to enforce GPU memory isolation for GPU-sharing workloads. It leverages HAMi-core to intercept CUDA calls inside the container and apply a hard memory limit, so each container only sees the GPU memory it was allocated.

For architecture details see the Design section.

Quick Start

1. Deploy KAI-Scheduler with GPU sharing enabled

Follow the KAI-Scheduler deployment guide and enable gpushare and hamicore:

helm install kai-scheduler oci://ghcr.io/nvidia/kai-scheduler \
  --set scheduler.gpuSharing.enabled=true \
  --set scheduler.gpuSharing.hamicoreEnabled=true \
  --namespace kai-scheduler --create-namespace

2. Deploy kai-resource-isolator

Install directly from the OCI registry:

helm install kai-resource-isolator oci://docker.io/projecthami/kai-resource-isolator \
  --namespace kai-resource-isolator --create-namespace \
  --version 1.0.0-chart

Note: Chart versions carry a -chart suffix (e.g. 1.0.0-chart). Available versions are listed at projecthami/kai-resource-isolator on Docker Hub.

Build

The build context must be the kai-resource-isolator repository root (the directory that contains go.mod, libvgpu/, and cmd/).

git submodule update --init --recursive
docker build -f docker/Dockerfile -t <registry>/<project>/kai-resource-isolator:<tag> .

Customization

Tune paths.containerVgpuMount and webhook.gpuShareResources for your environment and HAMi extended resource names.

Because this chart installs a MutatingWebhookConfiguration, the webhook server requires a valid TLS certificate. The chart ships with two modes:

Mode Values Requires
Helm hook (default) tls.patch.enabled: true Nothing — a Job auto-generates a self-signed cert and patches the webhook CA bundle
cert-manager tls.certManager.enabled: true + tls.patch.enabled: false cert-manager installed in the cluster

Design

GPU sharing in KAI-Scheduler allows a Pod to request a fraction of a GPU (e.g. 0.5) or a specific amount of GPU memory. Without memory isolation, however, containers could still access the full GPU memory at the CUDA level.

kai-resource-isolator closes this gap by combining two components:

Component Role
DaemonSet (libsync) Copies libvgpu.so (HAMi-core) to /usr/local/vgpu on every GPU node
Mutating webhook Injects the libvgpu hostPath volume and ld.so.preload into Pods that request GPU-sharing resources

The full flow when a GPU-sharing Pod is submitted:

  1. KAI-Scheduler selects a node and injects the CUDA_DEVICE_MEMORY_LIMIT environment variable into the Pod, set to the allocated memory amount.
  2. kai-resource-isolator webhook injects a hostPath volume mount (/usr/local/vgpu) and patches /etc/ld.so.preload so that libvgpu.so is loaded by the container at runtime.
  3. The container starts; libvgpu.so intercepts CUDA memory allocation calls and enforces the limit set by CUDA_DEVICE_MEMORY_LIMIT.

Architecture

About

resource isolator for KAI-scheduler, use hami-core to provide resource isolation inside container for NVIDIA devices

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors