fix: clear to-allocate annotations after successful device binding #1104

Kevinz857 · 2025-06-05T13:40:22Z

PR Description

Brief Description

Fix issue #987 where pods with successfully bound devices retain hami.io/vgpu-devices-to-allocate annotations, causing scheduler confusion and Kubernetes 1.20 compatibility issues.

Problem

Successfully allocated pods keep hami.io/vgpu-devices-to-allocate annotations after binding
GPU UUID mismatch between annotations and actual allocated devices
Scheduler repeatedly processes already bound pods (especially on K8s 1.20)
SchedulerError events: "pod xxx is in the cache, so can't be assumed"
Temporary workaround: restart hami-scheduler pod

Root Cause:
The hami.io/vgpu-devices-to-allocate annotations are set during scheduling but never cleared after successful binding, causing Kubernetes 1.20 scheduler to treat these pods as unscheduled.

Solution

Clear to-allocate annotations on successful binding
- Modified updatePodAnnotationsAndReleaseLock() in pkg/device/devices.go
- Clear all util.InRequestDevices annotations when deviceBindPhase == util.DeviceBindSuccess
Add scheduler protection
- Enhanced onAddPod() in pkg/scheduler/scheduler.go
- Skip processing pods with bind-phase: success to prevent redundant operations
Update tests
- Modified Test_PodAllocationTrySuccess to verify annotation clearing behavior

Testing

All existing unit tests pass: go test ./pkg/device/ -v
Added test verification for annotation clearing after successful binding
Verified backward compatibility with existing functionality

Expected behavior after fix:

Successfully bound pods only have hami.io/vgpu-devices-allocated and hami.io/bind-phase: success
No residual hami.io/vgpu-devices-to-allocate annotations
Scheduler stops repeatedly processing already bound pods

Type of Change

Bug fix (non-breaking change which fixes an issue)
Improvement (enhancement to existing functionality)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Files Changed

pkg/device/devices.go - Clear to-allocate annotations on successful binding
pkg/scheduler/scheduler.go - Skip processing successfully bound pods
pkg/device/devices_test.go - Update tests to verify new behavior

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective
New and existing unit tests pass locally with my changes

Related Issues

Fixes #987

hami-robott · 2025-06-05T13:40:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Kevinz857
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robott · 2025-06-05T13:40:34Z

Welcome @Kevinz857! It looks like this is your first PR to Project-HAMi/HAMi 🎉

- Clear hami.io/vgpu-devices-to-allocate and other to-allocate annotations when device binding succeeds - Add scheduler protection to skip processing successfully bound pods - Fix issue Project-HAMi#987 where pods retained to-allocate annotations causing scheduler confusion - Update tests to verify annotation clearing behavior This prevents Kubernetes 1.20 scheduler from repeatedly processing already allocated pods, resolving UUID mismatches and SchedulerError events. Signed-off-by: Kevinz857 <[email protected]>

codecov · 2025-06-05T23:06:31Z

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pkg/scheduler/scheduler.go	0.00%	5 Missing and 1 partial ⚠️

Flag	Coverage Δ
unittests	`63.13% <62.50%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/device/devices.go	`74.46% <100.00%> (+0.81%)`	⬆️
pkg/scheduler/scheduler.go	`49.62% <0.00%> (-0.77%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

archlitchi · 2025-06-06T09:02:30Z

i get it, sometimes, a v1.20 scheduler process a pod several times(which is perhaps a bug in v1.20),but by erasing 'to-allocate' allocation, we will make sure only one GPU is binded successfully.

archlitchi · 2025-06-06T09:02:53Z

CC @Shouren @chaunceyjiang

Shouren · 2025-06-06T11:11:25Z

pkg/scheduler/scheduler.go

+	if bindPhase, exists := pod.Annotations[util.DeviceBindPhase]; exists && bindPhase == util.DeviceBindSuccess {
+		klog.V(5).InfoS("Skipping successfully bound pod to prevent scheduler confusion", "pod", pod.Name, "namespace", pod.Namespace, "bindPhase", bindPhase)
+		podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations)
+		s.addPod(pod, nodeID, podDev)


I am confused about why the addPod function is always called, regardless of whether the pod matches the condition.

@Shouren Even for successfully bound Pods, we still need to call addPod to ensure that the Pod and its device usage are correctly tracked in the scheduler's internal state. This is important for resource accounting and subsequent Pod scheduling decisions.

The key difference is:

By checking the DeviceBindPhase flag, we avoid duplicate scheduling processing

But addPod is still needed to update the scheduler's internal state and resource tracking

This ensures that resources are allocated correctly while avoiding duplicate processing

podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations) s.addPod(pod, nodeID, podDev) return

@Kevinz857 Since addPod is still need to be called, can we simplify it by removing those lines of code ?

This commit adds unit tests for the skip processing logic added in PR Project-HAMi#1104 to fix issue Project-HAMi#987, where pods with successful device binding were not properly identified, causing scheduler to reprocess them unnecessarily. The test verifies that: 1. Pods marked with DeviceBindPhase=success are identified correctly 2. Both regular and successfully bound pods are added for resource tracking 3. The appropriate path is taken for bound pods to prevent duplicate processing 4. Both types of pods are properly registered in the pod manager Signed-off-by: Kevin <[email protected]> Signed-off-by: Kevinz857 <[email protected]>

Shouren · 2025-06-09T09:56:16Z

pkg/scheduler/scheduler.go

+	if bindPhase, exists := pod.Annotations[util.DeviceBindPhase]; exists && bindPhase == util.DeviceBindSuccess {
+		klog.V(5).InfoS("Skipping successfully bound pod to prevent scheduler confusion", "pod", pod.Name, "namespace", pod.Namespace, "bindPhase", bindPhase)
+		podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations)
+		s.addPod(pod, nodeID, podDev)


podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations) s.addPod(pod, nodeID, podDev) return

@Kevinz857 Since addPod is still need to be called, can we simplify it by removing those lines of code ?

Shouren · 2025-06-09T10:13:24Z

pkg/device/devices.go

+		klog.V(5).Infof("Clearing to-allocate annotations for successfully bound pod %s/%s", pod.Namespace, pod.Name)
+		for _, toAllocateKey := range util.InRequestDevices {
+			// Set to empty string to remove the annotation
+			newAnnos[toAllocateKey] = ""


The annotations of a Pod after a successful device allocation would be like this in my local cluster:

apiVersion: v1 kind: Pod metadata: annotations: hami.io/bind-phase: success hami.io/bind-time: "1749202603" hami.io/vgpu-devices-allocated: GPU-cf25b1b9-0695-4853-b322-61f8dd89ba1b,NVIDIA,81920,0:; hami.io/vgpu-devices-to-allocate: ;

@Kevinz857 I am not sure if setting hami.io/vgpu-devices-to-allocate in annotations to an empty string will break the default behavior or not.

hami-robott bot added the dco-signoff: no label Jun 5, 2025

hami-robott bot requested review from chaunceyjiang and wawa0210 June 5, 2025 13:40

github-actions bot added the kind/bug Something isn't working label Jun 5, 2025

hami-robott bot added the size/M label Jun 5, 2025

Kevinz857 force-pushed the fix/issue-987-clear-to-allocate-annotation branch from b9cb958 to 9589595 Compare June 5, 2025 13:47

hami-robott bot added dco-signoff: yes and removed dco-signoff: no labels Jun 5, 2025

Kevinz857 had a problem deploying to nvidia June 5, 2025 23:09 — with GitHub Actions Failure

Kevinz857 temporarily deployed to nvidia June 6, 2025 08:27 — with GitHub Actions Inactive

Shouren reviewed Jun 6, 2025

View reviewed changes

hami-robott bot added size/L and removed size/M labels Jun 6, 2025

Kevinz857 mentioned this pull request Jun 8, 2025

GPU Device Allocation Error: "unknown device" when using HAMi vGPU #1106

Open

Shouren reviewed Jun 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: clear to-allocate annotations after successful device binding #1104

fix: clear to-allocate annotations after successful device binding #1104

Kevinz857 commented Jun 5, 2025

Uh oh!

hami-robott bot commented Jun 5, 2025

Uh oh!

hami-robott bot commented Jun 5, 2025

Uh oh!

codecov bot commented Jun 5, 2025

Uh oh!

archlitchi commented Jun 6, 2025

Uh oh!

archlitchi commented Jun 6, 2025

Uh oh!

Shouren Jun 6, 2025

Uh oh!

Kevinz857 Jun 6, 2025

Uh oh!

Shouren Jun 9, 2025

Uh oh!

Shouren Jun 9, 2025

Uh oh!

Shouren Jun 9, 2025

Uh oh!

Uh oh!

fix: clear to-allocate annotations after successful device binding #1104

Are you sure you want to change the base?

fix: clear to-allocate annotations after successful device binding #1104

Conversation

Kevinz857 commented Jun 5, 2025

PR Description

Brief Description

Problem

Solution

Testing

Type of Change

Files Changed

Checklist

Related Issues

Uh oh!

hami-robott bot commented Jun 5, 2025

Uh oh!

hami-robott bot commented Jun 5, 2025

Uh oh!

codecov bot commented Jun 5, 2025

Codecov Report

Uh oh!

archlitchi commented Jun 6, 2025

Uh oh!

archlitchi commented Jun 6, 2025

Uh oh!

Shouren Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Kevinz857 Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Shouren Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Shouren Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Shouren Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!