kai-resource-isolator works alongside KAI-Scheduler to enforce GPU memory isolation for GPU-sharing workloads. It leverages HAMi-core to intercept CUDA calls inside the container and apply a hard memory limit, so each container only sees the GPU memory it was allocated.
For architecture details see the Design section.
Follow the KAI-Scheduler deployment guide and enable gpushare and hamicore:
helm install kai-scheduler oci://ghcr.io/nvidia/kai-scheduler \
--set scheduler.gpuSharing.enabled=true \
--set scheduler.gpuSharing.hamicoreEnabled=true \
--namespace kai-scheduler --create-namespaceInstall directly from the OCI registry:
helm install kai-resource-isolator oci://docker.io/projecthami/kai-resource-isolator \
--namespace kai-resource-isolator --create-namespace \
--version 1.0.0-chartNote: Chart versions carry a -chart suffix (e.g. 1.0.0-chart). Available versions are listed at projecthami/kai-resource-isolator on Docker Hub.
The build context must be the kai-resource-isolator repository root (the directory that contains go.mod, libvgpu/, and cmd/).
git submodule update --init --recursive
docker build -f docker/Dockerfile -t <registry>/<project>/kai-resource-isolator:<tag> .Tune paths.containerVgpuMount and webhook.gpuShareResources for your environment and HAMi extended resource names.
Because this chart installs a MutatingWebhookConfiguration, the webhook server requires a valid TLS certificate. The chart ships with two modes:
| Mode | Values | Requires |
|---|---|---|
| Helm hook (default) | tls.patch.enabled: true |
Nothing — a Job auto-generates a self-signed cert and patches the webhook CA bundle |
| cert-manager | tls.certManager.enabled: true + tls.patch.enabled: false |
cert-manager installed in the cluster |
GPU sharing in KAI-Scheduler allows a Pod to request a fraction of a GPU (e.g. 0.5) or a specific amount of GPU memory. Without memory isolation, however, containers could still access the full GPU memory at the CUDA level.
kai-resource-isolator closes this gap by combining two components:
| Component | Role |
|---|---|
| DaemonSet (libsync) | Copies libvgpu.so (HAMi-core) to /usr/local/vgpu on every GPU node |
| Mutating webhook | Injects the libvgpu hostPath volume and ld.so.preload into Pods that request GPU-sharing resources |
The full flow when a GPU-sharing Pod is submitted:
- KAI-Scheduler selects a node and injects the
CUDA_DEVICE_MEMORY_LIMITenvironment variable into the Pod, set to the allocated memory amount. - kai-resource-isolator webhook injects a
hostPathvolume mount (/usr/local/vgpu) and patches/etc/ld.so.preloadso thatlibvgpu.sois loaded by the container at runtime. - The container starts;
libvgpu.sointercepts CUDA memory allocation calls and enforces the limit set byCUDA_DEVICE_MEMORY_LIMIT.
