kai-resource-isolator

kai-resource-isolator works alongside KAI-Scheduler to enforce GPU memory isolation for GPU-sharing workloads. It leverages HAMi-core to intercept CUDA calls inside the container and apply a hard memory limit, so each container only sees the GPU memory it was allocated.

For architecture details see the Design section.

Quick Start

1. Deploy KAI-Scheduler with GPU sharing enabled

Follow the KAI-Scheduler deployment guide and enable gpushare and hamicore:

helm install kai-scheduler oci://ghcr.io/nvidia/kai-scheduler \
  --set scheduler.gpuSharing.enabled=true \
  --set scheduler.gpuSharing.hamicoreEnabled=true \
  --namespace kai-scheduler --create-namespace

2. Deploy kai-resource-isolator

Install directly from the OCI registry:

helm install kai-resource-isolator oci://docker.io/projecthami/kai-resource-isolator \
  --namespace kai-resource-isolator --create-namespace \
  --version 1.0.0-chart

Note: Chart versions carry a -chart suffix (e.g. 1.0.0-chart). Available versions are listed at projecthami/kai-resource-isolator on Docker Hub.

Build

The build context must be the kai-resource-isolator repository root (the directory that contains go.mod, libvgpu/, and cmd/).

git submodule update --init --recursive
docker build -f docker/Dockerfile -t <registry>/<project>/kai-resource-isolator:<tag> .

Customization

Tune paths.containerVgpuMount and webhook.gpuShareResources for your environment and HAMi extended resource names.

Because this chart installs a MutatingWebhookConfiguration, the webhook server requires a valid TLS certificate. The chart ships with two modes:

Mode	Values	Requires
Helm hook (default)	`tls.patch.enabled: true`	Nothing — a Job auto-generates a self-signed cert and patches the webhook CA bundle
cert-manager	`tls.certManager.enabled: true` + `tls.patch.enabled: false`	cert-manager installed in the cluster

Design

GPU sharing in KAI-Scheduler allows a Pod to request a fraction of a GPU (e.g. 0.5) or a specific amount of GPU memory. Without memory isolation, however, containers could still access the full GPU memory at the CUDA level.

kai-resource-isolator closes this gap by combining two components:

Component	Role
DaemonSet (libsync)	Copies `libvgpu.so` (HAMi-core) to `/usr/local/vgpu` on every GPU node
Mutating webhook	Injects the `libvgpu` hostPath volume and `ld.so.preload` into Pods that request GPU-sharing resources

The full flow when a GPU-sharing Pod is submitted:

KAI-Scheduler selects a node and injects the CUDA_DEVICE_MEMORY_LIMIT environment variable into the Pod, set to the allocated memory amount.
kai-resource-isolator webhook injects a hostPath volume mount (/usr/local/vgpu) and patches /etc/ld.so.preload so that libvgpu.so is loaded by the container at runtime.
The container starts; libvgpu.so intercepts CUDA memory allocation calls and enforces the limit set by CUDA_DEVICE_MEMORY_LIMIT.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
chart/kai-resource-isolator		chart/kai-resource-isolator
cmd/webhook		cmd/webhook
docker		docker
libvgpu @ 4bbd97a		libvgpu @ 4bbd97a
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
LICENSE		LICENSE
OWNERS		OWNERS
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kai-resource-isolator

Quick Start

1. Deploy KAI-Scheduler with GPU sharing enabled

2. Deploy kai-resource-isolator

Build

Customization

Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kai-resource-isolator

Quick Start

1. Deploy KAI-Scheduler with GPU sharing enabled

2. Deploy kai-resource-isolator

Build

Customization

Design

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages