Fix HAMi-core Unusable #39
Merged
Merged
Conversation
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Contributor
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, Shouren The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I tried to launch a task with HAMi-DRA + k8s-dra-driver, but it failed, here's the suggestion of claude-code
Bug 1 — GPU Devices Discovered but Never Stored (nvlib.go)
Location: GetPerGpuAllocatableDevices() in nvlib.go, line 256–259
Root Cause:
The function iterates over physical GPUs via VisitDevices. For each GPU, it creates a local map thisGPUAllocatable, wraps the GPU as a hami-gpu virtual device, and stores it — but then immediately return nil without ever writing thisGPUAllocatable into perGPUAllocatable, which is the map that the function ultimately returns.
Consequence:
GetPerGpuAllocatableDevices() always returns an empty map. enumerateAllPossibleDevices() therefore also returns zero devices. The ResourceSlice published to the Kubernetes API has devices: null, so the scheduler sees no GPU resources available on any node. No GPU workload can be scheduled, and the Monitor component collects no metrics (empty /metrics response).
Bug 2 — Wrong Map Key for Device Lookup (nvlib.go)
Location: Same HAMiCoreSupport block in nvlib.go, line 257
gpuInfo is a *GpuInfo whose CanonicalName() returns "gpu-" (e.g. "gpu-0"). After wrapping via wrapHAMiCoreGpu(), the device becomes a HAMiGpuInfo whose CanonicalName() returns "hami-gpu-" (e.g. "hami-gpu-0"). The original code stores the device under the pre-wrap key "gpu-0", but the Kubernetes scheduler allocates the device by its post-wrap name "hami-gpu-0" in the ResourceClaim.
Consequence:
When kubelet calls NodePrepareResources to prepare a ResourceClaim, the driver looks up s.allocatable["hami-gpu-0"] and finds nothing, because the actual entry is stored under "gpu-0". The driver returns the error:
prepare devices failed: requested device is not allocatable: hami-gpu-0
The Pod stays in ContainerCreating indefinitely. Even if Bug 1 were fixed alone (devices do appear in the ResourceSlice and a workload gets scheduled), the Pod would still fail to start because NodePrepareResources cannot resolve the device name.
Bug 3 — HAMiGpu Excluded from CDI Spec Cache Warmup (device_state.go)
Location: NewDeviceState() in device_state.go, lines 115–122
Root Cause:
wrapHAMiCoreGpu() explicitly sets parentDev.Gpu = nil and moves the GPU info into parentDev.HAMiGpu. The CDI cache warmup loop only checks dev.Gpu != nil, so it never collects UUIDs from HAMiGpu devices and the warmup runs with an empty UUID list.
Consequence:
The CDI spec cache is not pre-populated for any HAMi virtual GPU. The log always shows:
Warming up CDI device spec cache for GPUs []
At device preparation time, the CDI spec cache must be consulted to generate the correct CDI device edits (device file mounts, LD_PRELOAD injection, etc.). An empty cache means the CDI edits for the physical GPU (e.g. management.nvidia.com-gpu.yaml) are not looked up ahead of time. In practice, the driver falls back to the HAMi-specific GetCDIContainerEdits path in hami_core.go for the libvgpu.so injection, so this bug does not cause an outright failure on its own — but it creates unnecessary latency on the first preparation of each device and removes a safety net that would catch CDI spec errors early at startup rather than at pod creation time.