-
Notifications
You must be signed in to change notification settings - Fork 101
Fix invalid label key format in GPU group handling. #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
eae5182 to
367ec33
Compare
367ec33 to
391e8a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @shchennu ,
I would like to understand how you reached a case of a pod with an empty gpu group name.
Can you please attach the yaml of the pod/bindingRequest this error happened for?
Recently we have upgraded from 2.19 to 2.21. |
|
While I agree this change makes sense, I’d like to better understand of how the issue occurred. According to the error log, the binder failed to sync resource reservation pods during startup. Specifically, SyncForGpuGroup was called with an empty GPU group. This indicates that gpuGroup was an empty string, yet still present as a key in gpuGroupsMap. This situation can only happen if a pod has Is it also your observation? Do you know how GPU group was empty at first place? |
It was an observation:
|
|
Isn't a pod with empty GPU group a big issue that one would want the system to fail for or at least be loud about instead of just logging it? It means this pod will be stuck there forever attached to a GPU without actually reserving it. |
|
I agree with @enoodle . We use the gpuGroup name extensively in our code base, and it would be better to fail if the name isn't set correctly. |
Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods? |
Here is the Nvidia Case #00882449 for you to take a look at. |
I am not able to find the ticket.
Can you send the relevant data here or contact the appropriate support directly? 2.19 and 2.21 aren't KAI versions. I guess you use the run:ai commercial product. |
|
@shchennu - can you please update on your thoughts regarding the questions above? |
Fix Invalid Label Key Format in GPU Group Handling
Issue Description
When starting the KAI Scheduler Binder, the resource reservation sync process is failing with the following error:
This occurs when the scheduler tries to use a label selector with an empty value after the slash (
runai-gpu-group/), which violates Kubernetes naming rules for label values. The specific problem happens when handling a GPU group with an empty name during the resource reservation sync process.Root Cause
In
pkg/common/resources/gpu_sharing.go, theGetMultiFractionGpuGroupLabelfunction concatenates the constant prefixMultiGpuGroupLabelPrefix(which is already "runai-gpu-group/") with a GPU group name. When the GPU group name is empty, this results in a label key with a trailing slash but no value, which Kubernetes rejects as invalid.This function is called in
pkg/binder/binding/resourcereservation/resource_reservation.goduring theSyncprocess, and when the GPU group is empty, it tries to create an invalid label selector.Changes
This PR fixes the issue by properly handling the case of empty GPU group names in one of two ways:
Option 1: Skip processing empty GPU group names in the sync process:
Option 2: Ensure
GetMultiFractionGpuGroupLabelalways returns a valid label key-value pair:We've chosen to implement Option 1 as it's a more conservative change that preserves the original behavior while preventing the error condition.
Testing Done
Impact
This fix resolves the startup error in environments where the binder tries to process GPU groups with empty names. It prevents the error without changing the behavior of GPU fractional scheduling for valid GPU group names.