Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,22 @@ This section defines the architectural agents within the project for SDD.
* Provides `NewMigSpecTupleFromCanonicalName` for parsing canonical device names (e.g., `gpu-1-mig-2g47gb-14-0`) back into `MigSpecTuple`.
* **Constraints:** Used exclusively within the `DynamicMIG` feature gate code paths.

### 9. HAMi Core Monitor (Metrics & QoS Agent)
**Source:** upstream `vGPUmonitor` binary (from `projecthami/hami:${HAMI_VGPUMONITOR_IMAGE}`)
* **Role:** Exports Prometheus GPU metrics and performs soft-QoS feedback for HAMi-Core virtualized workloads.
* **Responsibilities:**
* Scans `<hostHookPath>/vgpu/containers/<podUID>_<containerName>/` for `.cache` files created by `libvgpu.so`.
* Auto-detects v0 (`1197897` byte) and v1 (`majorVersion == 1`) cache formats, providing backward compatibility.
* Emits per-container vGPU metrics with pod-aware labels.
* Emits host-level GPU metrics (`hami_host_gpu_memory_used_bytes`, `hami_host_gpu_utilization_ratio`) via NVML.
* Applies soft-QoS feedback (`recentKernel`/`utilizationSwitch`) by reading/writing the mmaped shared-region.
* Serves metrics on `:9394/metrics`.
* **Constraints (DRA Mode):**
* Runs as a DaemonSet sidecar in the kubelet plugin pod.
* Activated by `DRA_MODE=true`; in this mode MIG metrics collection and stale-cache self-cleanup are disabled (the DRA driver owns lifecycle cleanup).
* Requires `HOOK_PATH` and `NODE_NAME` environment variables.
* Requires `host-vgpu` and `host-tmp` volume mounts for cache access.

---

## Part 3: Feature Gate Registry
Expand Down Expand Up @@ -225,6 +241,7 @@ The project produces a single distroless-based container image that bundles all
| Path in Image | Source Stage | Purpose |
|---|---|---|
| `/usr/bin/hami-kubelet-plugin` | `build` | Main Driver Agent binary. |
| `/usr/bin/vGPUmonitor` | upstream HAMi image (`projecthami/hami:*`) | GPU monitor and Prometheus metrics exporter for HAMi-Core. |
| `/usr/local/lib/hami/libvgpu.so` | `hami-core-build` | Enforcement library injected into containers. |
| `/usr/local/lib/hami/ld.so.preload` | `hami-core-build` | Preload config that activates `libvgpu.so` in containers. |
| `/usr/bin/vgpu-init.sh` | `hami-core-build` | Node-level initialization script for vGPU. |
Expand All @@ -246,11 +263,13 @@ helm install hami-dra-driver ./chart/hami-dra-driver \
```

Key templates:
- `daemonset.yaml` — Deploys the kubelet plugin DaemonSet.
- `daemonset.yaml` — Deploys the kubelet plugin DaemonSet; conditionally injects the `vGPUmonitor` sidecar when `monitor.enabled=true`.
- `rbac-kubeletplugin.yaml.yaml` — RBAC including granular DRA status authorization rules.
- `deviceclass-hami-gpu.yaml` — The `DeviceClass` for `hami-core-gpu.project-hami.io`.
- `validation.yaml` — Helm validation hooks.

The `monitor.enabled` value (default `false`) controls whether the metrics sidecar is rendered. When enabled, the sidecar mounts `host-vgpu` and `host-tmp` and exposes port `9394`. The kubelet plugin itself does not expose metrics directly — the monitor container handles all metric scraping.

### 4. Build Commands
The build is orchestrated via `Makefile` (top-level) and `deploy/container/Makefile` (image builds).

Expand Down Expand Up @@ -291,3 +310,4 @@ make -f deploy/container/Makefile build BUILD_MULTI_ARCH_IMAGES=true PUSH_ON_BUI
| `0d0d90a` | feat: Support install with helm chart | Added `chart/hami-dra-driver/` for Helm-based cluster deployment. |
| `a2ad09e` | fix: inject failed for hami-gpu | Prepare logic bypasses overlap validation and partial-rollback when `HAMiCoreSupport` is enabled; completed claims are non-idempotent. |
| `6841f23` | fix: invalide featuregates | `pkg/flags/` package extracted for reusable CLI flags (`FeatureGateConfig`, `LoggingConfig`, `KubeClientConfig`); `ComputeDomainCliques` default changed to `false`. |
| `HEAD` | feat: add vGPUmonitor DRA support | Replaced `cmd/hami-core-monitor/` with upstream `vGPUmonitor`. DRA driver creates `<podUID>_<containerName>/<claimUID>.cache` layout. `HAMI_VGPUMONITOR_IMAGE` build-arg is configurable. |
32 changes: 32 additions & 0 deletions chart/hami-dra-driver/templates/daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,8 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: HOOK_PATH
value: {{ .Values.driver.hostHookPath | quote }}
- name: IMAGE_NAME
value: {{ include "hami-dra-driver.fullimage" . }}
{{- if .Values.nvidiaCDIHookPath }}
Expand Down Expand Up @@ -176,6 +178,36 @@ spec:
mountPath: /proc/
mountPropagation: Bidirectional
{{- end }}
{{- if .Values.monitor.enabled }}
- name: monitor
image: {{ include "hami-dra-driver.fullimage" . }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
securityContext:
privileged: true
command: ["vGPUmonitor"]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOOK_PATH
value: {{ .Values.driver.vgpuInitPath | quote }}
- name: DRA_MODE
value: "true"
{{- with .Values.monitor.resources }}
resources:
{{- toYaml . | nindent 10 }}
{{- end }}
ports:
- name: metrics
containerPort: 9394
protocol: TCP
volumeMounts:
- name: host-vgpu
mountPath: {{ .Values.driver.vgpuInitPath | quote }}
- name: host-tmp
mountPath: {{ .Values.driver.hostTmp | quote }}
{{- end }}
volumes:
- name: plugins-registry
hostPath:
Expand Down
5 changes: 5 additions & 0 deletions chart/hami-dra-driver/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ driver:
cdiRoot: /var/run/cdi
vgpuInitPath: /usr/local/vgpu
hostTmp: /tmp
hostHookPath: /usr/local

# Feature gates forwarded to the hami-kubelet-plugin binary as the
# FEATURE_GATES environment variable.
Expand All @@ -78,6 +79,10 @@ featureGates: {}
# 0 = errors/warnings/info only; higher numbers increase verbosity.
logVerbosity: "4"

monitor:
enabled: false
resources: {}

kubeletPlugin:
priorityClassName: "system-node-critical"
updateStrategy:
Expand Down
5 changes: 4 additions & 1 deletion cmd/hami-kubelet-plugin/device_state.go
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,10 @@ func NewDeviceState(ctx context.Context, config *Config) (*DeviceState, error) {

var hamiCoreManager *HAMiCoreManager
if featuregates.Enabled(featuregates.HAMiCoreSupport) {
hamiCoreManager = NewHAMiCoreManager(nvdevlib)
hamiCoreManager = NewHAMiCoreManager(nvdevlib, config.flags.hostHookPath, config.clientsets.Core, config.flags.nodeName)
if !hamiCoreManager.WaitForPodCacheSync(ctx) {
klog.Warningf("HAMiCoreManager Pod cache sync was cancelled or timed out; claim-to-pod resolution may be unavailable initially")
}
}

var tsManager *TimeSlicingManager
Expand Down
4 changes: 4 additions & 0 deletions cmd/hami-kubelet-plugin/driver.go
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,10 @@ func (d *driver) Shutdown() error {

d.wg.Wait()

if d.state.hamiCoreManager != nil {
d.state.hamiCoreManager.Stop()
}

if err := d.state.checkpointCleanupManager.Stop(); err != nil {
return fmt.Errorf("error stopping CheckpointCleanupManager: %w", err)
}
Expand Down
170 changes: 149 additions & 21 deletions cmd/hami-kubelet-plugin/hami_core.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,25 @@ limitations under the License.
package main

import (
"context"
"fmt"
"maps"
"os"
"path/filepath"
"slices"
"strconv"
"time"

"github.com/Masterminds/semver"
"github.com/google/uuid"

corev1 "k8s.io/api/core/v1"
resourceapi "k8s.io/api/resource/v1"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
corelisters "k8s.io/client-go/listers/core/v1"
"k8s.io/client-go/tools/cache"
"k8s.io/dynamic-resource-allocation/kubeletplugin"
"k8s.io/klog/v2"
"k8s.io/utils/ptr"
Expand Down Expand Up @@ -172,13 +180,47 @@ func (g *PreparedDeviceGroup) HAMIGpuUUIDs() []string {
type HAMiCoreManager struct {
hostHookPath string
nvdevlib *deviceLib
nodeName string

podInformerFactory informers.SharedInformerFactory
podLister corelisters.PodLister
podListerSynced cache.InformerSynced
stopCh chan struct{}
}

func NewHAMiCoreManager(deviceLib *deviceLib) *HAMiCoreManager {
return &HAMiCoreManager{
func NewHAMiCoreManager(deviceLib *deviceLib, hostHookPath string, clientset kubernetes.Interface, nodeName string) *HAMiCoreManager {
m := &HAMiCoreManager{
nvdevlib: deviceLib,
hostHookPath: "/usr/local",
hostHookPath: hostHookPath,
nodeName: nodeName,
stopCh: make(chan struct{}),
}
if clientset != nil {
m.podInformerFactory = informers.NewSharedInformerFactoryWithOptions(
clientset,
30*time.Minute,
informers.WithTweakListOptions(func(lo *metav1.ListOptions) {
lo.FieldSelector = "spec.nodeName=" + nodeName
}),
)
podInformer := m.podInformerFactory.Core().V1().Pods()
m.podLister = podInformer.Lister()
m.podListerSynced = podInformer.Informer().HasSynced
m.podInformerFactory.Start(m.stopCh)
}
return m
}

// WaitForPodCacheSync blocks until the local Pod cache has synced for the first time.
func (m *HAMiCoreManager) WaitForPodCacheSync(ctx context.Context) bool {
if m.podListerSynced == nil {
return true
}
return cache.WaitForCacheSync(ctx.Done(), m.podListerSynced)
}

func (m *HAMiCoreManager) Stop() {
close(m.stopCh)
}

func (m *HAMiCoreManager) getConsumableCapacityMap(claim *resourceapi.ResourceClaim) map[string]map[resourceapi.QualifiedName]resource.Quantity {
Expand All @@ -193,26 +235,91 @@ func (m *HAMiCoreManager) getConsumableCapacityMap(claim *resourceapi.ResourceCl
return resMap
}

func (m *HAMiCoreManager) GetCDIContainerEdits(claim *resourceapi.ResourceClaim, devs AllocatableDevices) *cdiapi.ContainerEdits {
cacheFileHostDirectory := fmt.Sprintf("%s/vgpu/claims/%s", m.hostHookPath, claim.UID)
// TODO: We should check the status of claim, becasue there may be two pod share the claim
var err error
err = os.RemoveAll(cacheFileHostDirectory)
if err != nil {
klog.Warningf("Failed to remove host directory for cachefile %s: %s", cacheFileHostDirectory, err)
// resolveClaimToPod searches the local Pod informer cache for the Pod that
// reserved the given claim. HAMi DRA guarantees a 1:1 claim-to-container
// binding, so it also returns the exact container name.
func (m *HAMiCoreManager) resolveClaimToPod(claim *resourceapi.ResourceClaim) (*corev1.Pod, string, error) {
if m.podLister == nil {
return nil, "", fmt.Errorf("pod lister not initialized")
}
if len(claim.Status.ReservedFor) == 0 {
return nil, "", fmt.Errorf("claim %s has no ReservedFor entries", claim.UID)
}
err = os.MkdirAll(cacheFileHostDirectory, 0777)

// Find the Pod that reserved this claim.
consumer := claim.Status.ReservedFor[0]
if consumer.Resource != "pods" {
return nil, "", fmt.Errorf("claim %s reservedFor[0] is not a Pod", claim.UID)
}

pod, err := m.podLister.Pods(claim.Namespace).Get(consumer.Name)
if err != nil {
klog.Warningf("Failed to create host directory for cachefile %s: %s", cacheFileHostDirectory, err)
return nil, "", fmt.Errorf("pod %s/%s not found in local cache: %w", claim.Namespace, consumer.Name, err)
}

// HAMi DRA design guarantees one claim per container, but we defensively
// iterate over all containers and init containers.
var containerName string
for _, c := range pod.Spec.Containers {
for _, rc := range c.Resources.Claims {
if rc.Name == claim.Name {
containerName = c.Name
break
}
}
if containerName != "" {
break
}
}
if containerName == "" {
for _, c := range pod.Spec.InitContainers {
for _, rc := range c.Resources.Claims {
if rc.Name == claim.Name {
containerName = c.Name
break
}
}
if containerName != "" {
break
}
}
}
err = os.Chmod(cacheFileHostDirectory, 0777)
if containerName == "" {
return nil, "", fmt.Errorf("no container in pod %s/%s references claim %s", claim.Namespace, pod.Name, claim.Name)
}

return pod, containerName, nil
}

func (m *HAMiCoreManager) GetCDIContainerEdits(claim *resourceapi.ResourceClaim, devs AllocatableDevices) *cdiapi.ContainerEdits {
pod, containerName, err := m.resolveClaimToPod(claim)
if err != nil {
klog.Warningf("Failed to change mod of host directory for cachefile %s: %s", cacheFileHostDirectory, err)
klog.Warningf("HAMiCoreManager: cannot resolve claim %s to pod/container: %v", claim.UID, err)
// Fallback to claim-scoped directory so that Prepare does not hard-fail.
// Metrics will be incomplete, but the workload can still run.
pod = &corev1.Pod{}
pod.UID = claim.UID
containerName = "unknown"
}

podUID := string(pod.UID)
cacheFileHostDirectory := filepath.Join(m.hostHookPath, "vgpu", "containers", podUID+"_"+containerName)
cacheFilePath := filepath.Join(cacheFileHostDirectory, string(claim.UID)+".cache")

// Clean up and recreate the directory for this pod+container.
if err := os.RemoveAll(cacheFileHostDirectory); err != nil {
klog.Warningf("Failed to remove host directory for cachefile %s: %v", cacheFileHostDirectory, err)
}
if err := os.MkdirAll(cacheFileHostDirectory, 0777); err != nil {
klog.Warningf("Failed to create host directory for cachefile %s: %v", cacheFileHostDirectory, err)
}
if err := os.Chmod(cacheFileHostDirectory, 0777); err != nil {
klog.Warningf("Failed to chmod host directory for cachefile %s: %v", cacheFileHostDirectory, err)
}

hamiEnvs := []string{}
// TOOD: Get SM Limit from Claim's Annotation
hamiEnvs = append(hamiEnvs, fmt.Sprintf("CUDA_DEVICE_MEMORY_SHARED_CACHE=%s", fmt.Sprintf("%s/%v.cache", cacheFileHostDirectory, uuid.New().String())))
hamiEnvs = append(hamiEnvs, fmt.Sprintf("CUDA_DEVICE_MEMORY_SHARED_CACHE=%s", cacheFilePath))

devCapMap := m.getConsumableCapacityMap(claim)
idx := 0
Expand Down Expand Up @@ -255,14 +362,14 @@ func (m *HAMiCoreManager) GetCDIContainerEdits(claim *resourceapi.ResourceClaim,
Options: []string{"rw", "nosuid", "nodev", "bind"},
},
{
ContainerPath: m.hostHookPath + "/vgpu/libvgpu.so",
HostPath: m.hostHookPath + "/vgpu/libvgpu.so",
ContainerPath: filepath.Join(m.hostHookPath, "vgpu", "libvgpu.so"),
HostPath: filepath.Join(m.hostHookPath, "vgpu", "libvgpu.so"),
Options: []string{"ro", "nosuid", "nodev", "bind"},
},
// TODO: Check CUDA_DISABLE_CONTROL env before mount ld.so.preload
{
ContainerPath: "/etc/ld.so.preload",
HostPath: m.hostHookPath + "/vgpu/ld.so.preload",
HostPath: filepath.Join(m.hostHookPath, "vgpu", "ld.so.preload"),
Options: []string{"ro", "nosuid", "nodev", "bind"},
},
{
Expand All @@ -276,8 +383,29 @@ func (m *HAMiCoreManager) GetCDIContainerEdits(claim *resourceapi.ResourceClaim,
}

func (m *HAMiCoreManager) Unprepare(claimUID string, pl PreparedDeviceList) error {
path := fmt.Sprintf("%s/vgpu/claims/%s", m.hostHookPath, claimUID)
_ = os.RemoveAll(path)
containersPath := filepath.Join(m.hostHookPath, "vgpu", "containers")
entries, err := os.ReadDir(containersPath)
if err != nil {
if os.IsNotExist(err) {
return nil
}
return fmt.Errorf("failed to list containers path %s: %w", containersPath, err)
}
for _, entry := range entries {
if !entry.IsDir() {
continue
}
claimCache := filepath.Join(containersPath, entry.Name(), claimUID+".cache")
if _, err := os.Stat(claimCache); err == nil {
dirToRemove := filepath.Join(containersPath, entry.Name())
if err := os.RemoveAll(dirToRemove); err != nil {
return fmt.Errorf("failed to remove container cache directory %s: %w", dirToRemove, err)
}
klog.V(4).Infof("Unprepare: removed HAMi-Core cache directory %s for claim %s", dirToRemove, claimUID)
return nil
}
}
klog.V(4).Infof("Unprepare: no HAMi-Core cache directory found for claim %s", claimUID)
return nil
}

Expand Down
8 changes: 8 additions & 0 deletions cmd/hami-kubelet-plugin/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ type Flags struct {
cdiRoot string
containerDriverRoot string
hostDriverRoot string
hostHookPath string
nvidiaCDIHookPath string
imageName string
kubeletRegistrarDirectoryPath string
Expand Down Expand Up @@ -120,6 +121,13 @@ func newApp() *cli.App {
Destination: &flags.containerDriverRoot,
EnvVars: []string{"DRIVER_ROOT_CTR_PATH"},
},
&cli.StringFlag{
Name: "host-hook-path",
Value: "/usr/local",
Usage: "the host path where vGPU hooks and claim caches are rooted (the container must have this path mounted)",
Destination: &flags.hostHookPath,
EnvVars: []string{"HOOK_PATH"},
},
&cli.StringFlag{
Name: "nvidia-cdi-hook-path",
Usage: "Absolute path to the nvidia-cdi-hook executable in the host file system. Used in the generated CDI specification.",
Expand Down
Loading