Fix HAMi-core Unusable by archlitchi · Pull Request #39 · Project-HAMi/k8s-dra-driver

archlitchi · 2026-05-12T08:16:32Z

I tried to launch a task with HAMi-DRA + k8s-dra-driver, but it failed, here's the suggestion of claude-code

Bug 1 — GPU Devices Discovered but Never Stored (nvlib.go)
Location: GetPerGpuAllocatableDevices() in nvlib.go, line 256–259

Root Cause:

 if featuregates.Enabled(featuregates.HAMiCoreSupport) {
     thisGPUAllocatable[gpuInfo.CanonicalName()] = l.wrapHAMiCoreGpu(parentdev)
     hamiDev := l.wrapHAMiCoreGpu(parentdev)
     thisGPUAllocatable[hamiDev.CanonicalName()] = hamiDev
     perGPUAllocatable[gpuInfo.minor] = thisGPUAllocatable   // ← this line was missing
     return nil
 }

The function iterates over physical GPUs via VisitDevices. For each GPU, it creates a local map thisGPUAllocatable, wraps the GPU as a hami-gpu virtual device, and stores it — but then immediately return nil without ever writing thisGPUAllocatable into perGPUAllocatable, which is the map that the function ultimately returns.

Consequence:

GetPerGpuAllocatableDevices() always returns an empty map. enumerateAllPossibleDevices() therefore also returns zero devices. The ResourceSlice published to the Kubernetes API has devices: null, so the scheduler sees no GPU resources available on any node. No GPU workload can be scheduled, and the Monitor component collects no metrics (empty /metrics response).

Bug 2 — Wrong Map Key for Device Lookup (nvlib.go)
Location: Same HAMiCoreSupport block in nvlib.go, line 257

 thisGPUAllocatable[gpuInfo.CanonicalName()] = l.wrapHAMiCoreGpu(parentdev)
 hamiDev := l.wrapHAMiCoreGpu(parentdev)
 thisGPUAllocatable[hamiDev.CanonicalName()] = hamiDev

gpuInfo is a *GpuInfo whose CanonicalName() returns "gpu-" (e.g. "gpu-0"). After wrapping via wrapHAMiCoreGpu(), the device becomes a HAMiGpuInfo whose CanonicalName() returns "hami-gpu-" (e.g. "hami-gpu-0"). The original code stores the device under the pre-wrap key "gpu-0", but the Kubernetes scheduler allocates the device by its post-wrap name "hami-gpu-0" in the ResourceClaim.

Consequence:

When kubelet calls NodePrepareResources to prepare a ResourceClaim, the driver looks up s.allocatable["hami-gpu-0"] and finds nothing, because the actual entry is stored under "gpu-0". The driver returns the error:

prepare devices failed: requested device is not allocatable: hami-gpu-0
The Pod stays in ContainerCreating indefinitely. Even if Bug 1 were fixed alone (devices do appear in the ResourceSlice and a workload gets scheduled), the Pod would still fail to start because NodePrepareResources cannot resolve the device name.

Bug 3 — HAMiGpu Excluded from CDI Spec Cache Warmup (device_state.go)
Location: NewDeviceState() in device_state.go, lines 115–122

Root Cause:

 for _, dev := range allocatable {
     if dev.Gpu != nil {
         fullGPUuuids = append(fullGPUuuids, dev.Gpu.UUID)
     } else if dev.HAMiGpu != nil {
         fullGPUuuids = append(fullGPUuuids, dev.HAMiGpu.UUID)
     }
 }

wrapHAMiCoreGpu() explicitly sets parentDev.Gpu = nil and moves the GPU info into parentDev.HAMiGpu. The CDI cache warmup loop only checks dev.Gpu != nil, so it never collects UUIDs from HAMiGpu devices and the warmup runs with an empty UUID list.

Consequence:

The CDI spec cache is not pre-populated for any HAMi virtual GPU. The log always shows:

Warming up CDI device spec cache for GPUs []
At device preparation time, the CDI spec cache must be consulted to generate the correct CDI device edits (device file mounts, LD_PRELOAD injection, etc.). An empty cache means the CDI edits for the physical GPU (e.g. management.nvidia.com-gpu.yaml) are not looked up ahead of time. In practice, the driver falls back to the HAMi-specific GetCDIContainerEdits path in hami_core.go for the libvgpu.so injection, so this bug does not cause an outright failure on its own — but it creates unnecessary latency on the first preparation of each device and removes a safety net that would catch CDI spec errors early at startup rather than at pod creation time.

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Shouren

/lgtm

hami-robot · 2026-05-12T08:29:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, Shouren

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Shouren,archlitchi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix hami-core

e48ac70

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

hami-robot Bot added the dco-signoff: yes label May 12, 2026

hami-robot Bot requested a review from Shouren May 12, 2026 08:16

hami-robot Bot added the approved label May 12, 2026

github-actions Bot added the bug Something isn't working label May 12, 2026

hami-robot Bot added the size/XS label May 12, 2026

Shouren approved these changes May 12, 2026

View reviewed changes

hami-robot Bot assigned Shouren May 12, 2026

hami-robot Bot added the lgtm label May 12, 2026

hami-robot Bot merged commit ebf3cef into main May 12, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HAMi-core Unusable #39

Fix HAMi-core Unusable #39
hami-robot[bot] merged 1 commit into
mainfrom
update

archlitchi commented May 12, 2026

Uh oh!

Shouren left a comment

Uh oh!

hami-robot Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

archlitchi commented May 12, 2026

Uh oh!

Shouren left a comment

Choose a reason for hiding this comment

Uh oh!

hami-robot Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants