Skip to content

Fix HAMi-core Unusable #39

Merged
hami-robot[bot] merged 1 commit into
mainfrom
update
May 12, 2026
Merged

Fix HAMi-core Unusable #39
hami-robot[bot] merged 1 commit into
mainfrom
update

Conversation

@archlitchi

Copy link
Copy Markdown
Member

I tried to launch a task with HAMi-DRA + k8s-dra-driver, but it failed, here's the suggestion of claude-code

Bug 1 — GPU Devices Discovered but Never Stored (nvlib.go)
Location: GetPerGpuAllocatableDevices() in nvlib.go, line 256–259

Root Cause:

 if featuregates.Enabled(featuregates.HAMiCoreSupport) {
     thisGPUAllocatable[gpuInfo.CanonicalName()] = l.wrapHAMiCoreGpu(parentdev)
     hamiDev := l.wrapHAMiCoreGpu(parentdev)
     thisGPUAllocatable[hamiDev.CanonicalName()] = hamiDev
     perGPUAllocatable[gpuInfo.minor] = thisGPUAllocatable   // ← this line was missing
     return nil
 }

The function iterates over physical GPUs via VisitDevices. For each GPU, it creates a local map thisGPUAllocatable, wraps the GPU as a hami-gpu virtual device, and stores it — but then immediately return nil without ever writing thisGPUAllocatable into perGPUAllocatable, which is the map that the function ultimately returns.

Consequence:

GetPerGpuAllocatableDevices() always returns an empty map. enumerateAllPossibleDevices() therefore also returns zero devices. The ResourceSlice published to the Kubernetes API has devices: null, so the scheduler sees no GPU resources available on any node. No GPU workload can be scheduled, and the Monitor component collects no metrics (empty /metrics response).

Bug 2 — Wrong Map Key for Device Lookup (nvlib.go)
Location: Same HAMiCoreSupport block in nvlib.go, line 257

 thisGPUAllocatable[gpuInfo.CanonicalName()] = l.wrapHAMiCoreGpu(parentdev)
 hamiDev := l.wrapHAMiCoreGpu(parentdev)
 thisGPUAllocatable[hamiDev.CanonicalName()] = hamiDev

gpuInfo is a *GpuInfo whose CanonicalName() returns "gpu-" (e.g. "gpu-0"). After wrapping via wrapHAMiCoreGpu(), the device becomes a HAMiGpuInfo whose CanonicalName() returns "hami-gpu-" (e.g. "hami-gpu-0"). The original code stores the device under the pre-wrap key "gpu-0", but the Kubernetes scheduler allocates the device by its post-wrap name "hami-gpu-0" in the ResourceClaim.

Consequence:

When kubelet calls NodePrepareResources to prepare a ResourceClaim, the driver looks up s.allocatable["hami-gpu-0"] and finds nothing, because the actual entry is stored under "gpu-0". The driver returns the error:

prepare devices failed: requested device is not allocatable: hami-gpu-0
The Pod stays in ContainerCreating indefinitely. Even if Bug 1 were fixed alone (devices do appear in the ResourceSlice and a workload gets scheduled), the Pod would still fail to start because NodePrepareResources cannot resolve the device name.

Bug 3 — HAMiGpu Excluded from CDI Spec Cache Warmup (device_state.go)
Location: NewDeviceState() in device_state.go, lines 115–122

Root Cause:

 for _, dev := range allocatable {
     if dev.Gpu != nil {
         fullGPUuuids = append(fullGPUuuids, dev.Gpu.UUID)
     } else if dev.HAMiGpu != nil {
         fullGPUuuids = append(fullGPUuuids, dev.HAMiGpu.UUID)
     }
 }

wrapHAMiCoreGpu() explicitly sets parentDev.Gpu = nil and moves the GPU info into parentDev.HAMiGpu. The CDI cache warmup loop only checks dev.Gpu != nil, so it never collects UUIDs from HAMiGpu devices and the warmup runs with an empty UUID list.

Consequence:

The CDI spec cache is not pre-populated for any HAMi virtual GPU. The log always shows:

Warming up CDI device spec cache for GPUs []
At device preparation time, the CDI spec cache must be consulted to generate the correct CDI device edits (device file mounts, LD_PRELOAD injection, etc.). An empty cache means the CDI edits for the physical GPU (e.g. management.nvidia.com-gpu.yaml) are not looked up ahead of time. In practice, the driver falls back to the HAMi-specific GetCDIContainerEdits path in hami_core.go for the libvgpu.so injection, so this bug does not cause an outright failure on its own — but it creates unnecessary latency on the first preparation of each device and removes a safety net that would catch CDI spec errors early at startup rather than at pod creation time.

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
@hami-robot hami-robot Bot requested a review from Shouren May 12, 2026 08:16
@hami-robot hami-robot Bot added the approved label May 12, 2026
@github-actions github-actions Bot added the bug Something isn't working label May 12, 2026
@hami-robot hami-robot Bot added the size/XS label May 12, 2026

@Shouren Shouren left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot hami-robot Bot added the lgtm label May 12, 2026
@hami-robot

hami-robot Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, Shouren

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot Bot merged commit ebf3cef into main May 12, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants