Skip to content

Conversation

@cybergeek2077
Copy link
Contributor

The device plugin allocates a device by selecting the oldest pod, but it did not filter pods by node. This caused a bug if different nodes' pods have the same time assigned by Volcano (e.g., a Gang scheduler). It may get a pod on another node and so assign another node's GPU.

This PR fixes that bug by filtering pods by node name and also improves the oldest pod selection logic by filtering pods that are pending and have the right Volcano annotations.

Hami implement reference: Project-HAMi/HAMi#340

@cybergeek2077
Copy link
Contributor Author

@archlitchi The build failed with an error saying there's no space left on the device, but I can't find a way to restart it.

@archlitchi
Copy link
Member

hi, could you resolve this conflict? you can leave the rest to me

@cybergeek2077
Copy link
Contributor Author

I have resolved the conflict, also added filter the AssignedNodeAnnotations in my PR.
Below is a brief of my PR.

This pull request includes changes to the pkg/plugin/vgpu/util/util.go file to enhance the filtering and selection of pods based on specific criteria. The most important changes include modifying the GetPendingPod function to filter pods for a specific node and updating the getOldestPod function to select the oldest pending pod with specific annotations.

Improvements to pod filtering and selection:

  • pkg/plugin/vgpu/util/util.go: Modified the GetPendingPod function to filter pods for the specified node using a FieldSelector in ListOptions.
  • pkg/plugin/vgpu/util/util.go: Updated the getOldestPod function to select the oldest pending pod with specific annotations, ensuring that only pods in the v1.PodPending phase and with the correct DeviceBindPhase and AssignedNodeAnnotations are considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants