Skip to content

fix: clear to-allocate annotations after successful device binding #1104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Kevinz857
Copy link

PR Description

Brief Description

Fix issue #987 where pods with successfully bound devices retain hami.io/vgpu-devices-to-allocate annotations, causing scheduler confusion and Kubernetes 1.20 compatibility issues.

Problem

  • Successfully allocated pods keep hami.io/vgpu-devices-to-allocate annotations after binding
  • GPU UUID mismatch between annotations and actual allocated devices
  • Scheduler repeatedly processes already bound pods (especially on K8s 1.20)
  • SchedulerError events: "pod xxx is in the cache, so can't be assumed"
  • Temporary workaround: restart hami-scheduler pod

Root Cause:
The hami.io/vgpu-devices-to-allocate annotations are set during scheduling but never cleared after successful binding, causing Kubernetes 1.20 scheduler to treat these pods as unscheduled.

Solution

  1. Clear to-allocate annotations on successful binding

    • Modified updatePodAnnotationsAndReleaseLock() in pkg/device/devices.go
    • Clear all util.InRequestDevices annotations when deviceBindPhase == util.DeviceBindSuccess
  2. Add scheduler protection

    • Enhanced onAddPod() in pkg/scheduler/scheduler.go
    • Skip processing pods with bind-phase: success to prevent redundant operations
  3. Update tests

    • Modified Test_PodAllocationTrySuccess to verify annotation clearing behavior

Testing

  • All existing unit tests pass: go test ./pkg/device/ -v
  • Added test verification for annotation clearing after successful binding
  • Verified backward compatibility with existing functionality

Expected behavior after fix:

  • Successfully bound pods only have hami.io/vgpu-devices-allocated and hami.io/bind-phase: success
  • No residual hami.io/vgpu-devices-to-allocate annotations
  • Scheduler stops repeatedly processing already bound pods

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (enhancement to existing functionality)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Files Changed

  • pkg/device/devices.go - Clear to-allocate annotations on successful binding
  • pkg/scheduler/scheduler.go - Skip processing successfully bound pods
  • pkg/device/devices_test.go - Update tests to verify new behavior

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective
  • New and existing unit tests pass locally with my changes

Related Issues

Fixes #987

Copy link
Contributor

hami-robott bot commented Jun 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Kevinz857
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot added the kind/bug Something isn't working label Jun 5, 2025
Copy link
Contributor

hami-robott bot commented Jun 5, 2025

Welcome @Kevinz857! It looks like this is your first PR to Project-HAMi/HAMi 🎉

@hami-robott hami-robott bot added the size/M label Jun 5, 2025
- Clear hami.io/vgpu-devices-to-allocate and other to-allocate annotations when device binding succeeds
- Add scheduler protection to skip processing successfully bound pods
- Fix issue Project-HAMi#987 where pods retained to-allocate annotations causing scheduler confusion
- Update tests to verify annotation clearing behavior

This prevents Kubernetes 1.20 scheduler from repeatedly processing already allocated pods, resolving UUID mismatches and SchedulerError events.

Signed-off-by: Kevinz857 <[email protected]>
@Kevinz857 Kevinz857 force-pushed the fix/issue-987-clear-to-allocate-annotation branch from b9cb958 to 9589595 Compare June 5, 2025 13:47
Copy link

codecov bot commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/scheduler/scheduler.go 0.00% 5 Missing and 1 partial ⚠️
Flag Coverage Δ
unittests 63.13% <62.50%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/device/devices.go 74.46% <100.00%> (+0.81%) ⬆️
pkg/scheduler/scheduler.go 49.62% <0.00%> (-0.77%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@archlitchi
Copy link
Member

i get it, sometimes, a v1.20 scheduler process a pod several times(which is perhaps a bug in v1.20),but by erasing 'to-allocate' allocation, we will make sure only one GPU is binded successfully.

@archlitchi
Copy link
Member

CC @Shouren @chaunceyjiang

if bindPhase, exists := pod.Annotations[util.DeviceBindPhase]; exists && bindPhase == util.DeviceBindSuccess {
klog.V(5).InfoS("Skipping successfully bound pod to prevent scheduler confusion", "pod", pod.Name, "namespace", pod.Namespace, "bindPhase", bindPhase)
podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations)
s.addPod(pod, nodeID, podDev)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about why the addPod function is always called, regardless of whether the pod matches the condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shouren Even for successfully bound Pods, we still need to call addPod to ensure that the Pod and its device usage are correctly tracked in the scheduler's internal state. This is important for resource accounting and subsequent Pod scheduling decisions.

The key difference is:

  1. By checking the DeviceBindPhase flag, we avoid duplicate scheduling processing
  2. But addPod is still needed to update the scheduler's internal state and resource tracking
  3. This ensures that resources are allocated correctly while avoiding duplicate processing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations)
	s.addPod(pod, nodeID, podDev)
	return

@Kevinz857 Since addPod is still need to be called, can we simplify it by removing those lines of code ?

This commit adds unit tests for the skip processing logic added in PR Project-HAMi#1104
to fix issue Project-HAMi#987, where pods with successful device binding were not
properly identified, causing scheduler to reprocess them unnecessarily.

The test verifies that:
1. Pods marked with DeviceBindPhase=success are identified correctly
2. Both regular and successfully bound pods are added for resource tracking
3. The appropriate path is taken for bound pods to prevent duplicate processing
4. Both types of pods are properly registered in the pod manager

Signed-off-by: Kevin <[email protected]>
Signed-off-by: Kevinz857 <[email protected]>
if bindPhase, exists := pod.Annotations[util.DeviceBindPhase]; exists && bindPhase == util.DeviceBindSuccess {
klog.V(5).InfoS("Skipping successfully bound pod to prevent scheduler confusion", "pod", pod.Name, "namespace", pod.Namespace, "bindPhase", bindPhase)
podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations)
s.addPod(pod, nodeID, podDev)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	podDev, _ := util.DecodePodDevices(util.SupportDevices, pod.Annotations)
	s.addPod(pod, nodeID, podDev)
	return

@Kevinz857 Since addPod is still need to be called, can we simplify it by removing those lines of code ?

klog.V(5).Infof("Clearing to-allocate annotations for successfully bound pod %s/%s", pod.Namespace, pod.Name)
for _, toAllocateKey := range util.InRequestDevices {
// Set to empty string to remove the annotation
newAnnos[toAllocateKey] = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotations of a Pod after a successful device allocation would be like this in my local cluster:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    hami.io/bind-phase: success
    hami.io/bind-time: "1749202603"
    hami.io/vgpu-devices-allocated: GPU-cf25b1b9-0695-4853-b322-61f8dd89ba1b,NVIDIA,81920,0:;
    hami.io/vgpu-devices-to-allocate: ;

@Kevinz857 I am not sure if setting hami.io/vgpu-devices-to-allocate in annotations to an empty string will break the default behavior or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

已经分配的pod注解中hami.io/vgpu-devices-to-allocate不被清除
3 participants