Skip to content

Conversation

@shchennu
Copy link

@shchennu shchennu commented Apr 30, 2025

Fix Invalid Label Key Format in GPU Group Handling

Issue Description

When starting the KAI Scheduler Binder, the resource reservation sync process is failing with the following error:

ERROR setup unable to sync resource reservation {"error": "unable to parse requirement: <nil>: Invalid value: \"runai-gpu-group/\": name part must be non-empty; name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')"}

This occurs when the scheduler tries to use a label selector with an empty value after the slash (runai-gpu-group/), which violates Kubernetes naming rules for label values. The specific problem happens when handling a GPU group with an empty name during the resource reservation sync process.

Root Cause

In pkg/common/resources/gpu_sharing.go, the GetMultiFractionGpuGroupLabel function concatenates the constant prefix MultiGpuGroupLabelPrefix (which is already "runai-gpu-group/") with a GPU group name. When the GPU group name is empty, this results in a label key with a trailing slash but no value, which Kubernetes rejects as invalid.

func GetMultiFractionGpuGroupLabel(gpuGroup string) (string, string) {
    return constants.MultiGpuGroupLabelPrefix + gpuGroup, gpuGroup
}

This function is called in pkg/binder/binding/resourcereservation/resource_reservation.go during the Sync process, and when the GPU group is empty, it tries to create an invalid label selector.

Changes

This PR fixes the issue by properly handling the case of empty GPU group names in one of two ways:

  1. Option 1: Skip processing empty GPU group names in the sync process:

    func (rsc *service) syncForGpuGroupWithLock(ctx context.Context, gpuGroup string) error {
        if gpuGroup == "" {
            // Skip empty GPU groups
            return nil
        }
        // Rest of the function...
    }
  2. Option 2: Ensure GetMultiFractionGpuGroupLabel always returns a valid label key-value pair:

    func GetMultiFractionGpuGroupLabel(gpuGroup string) (string, string) {
        if gpuGroup == "" {
            return constants.GPUGroup, "default"
        }
        return constants.MultiGpuGroupLabelPrefix + gpuGroup, gpuGroup
    }

We've chosen to implement Option 1 as it's a more conservative change that preserves the original behavior while preventing the error condition.

Testing Done

  1. Verified that the binder starts successfully with the fix
  2. Confirmed that normal GPU fractional scheduling continues to work as expected
  3. Verified compatibility with existing deployments

Impact

This fix resolves the startup error in environments where the binder tries to process GPU groups with empty names. It prevents the error without changing the behavior of GPU fractional scheduling for valid GPU group names.

@shchennu shchennu force-pushed the fix_invalid_label_key branch from eae5182 to 367ec33 Compare May 1, 2025 04:39
@shchennu shchennu force-pushed the fix_invalid_label_key branch from 367ec33 to 391e8a9 Compare May 1, 2025 16:14
Copy link
Collaborator

@davidLif davidLif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @shchennu ,

I would like to understand how you reached a case of a pod with an empty gpu group name.
Can you please attach the yaml of the pod/bindingRequest this error happened for?

@shchennu
Copy link
Author

shchennu commented May 4, 2025

Hello @shchennu ,

I would like to understand how you reached a case of a pod with an empty gpu group name. Can you please attach the yaml of the pod/bindingRequest this error happened for?

Recently we have upgraded from 2.19 to 2.21.
This has caused the issue.

@shchennu shchennu requested review from davidLif and romanbaron May 4, 2025 16:41
@romanbaron
Copy link
Collaborator

romanbaron commented May 4, 2025

While I agree this change makes sense, I’d like to better understand of how the issue occurred.

According to the error log, the binder failed to sync resource reservation pods during startup. Specifically, SyncForGpuGroup was called with an empty GPU group. This indicates that gpuGroup was an empty string, yet still present as a key in gpuGroupsMap. This situation can only happen if a pod has runai-gpu-group label set, but with an empty value.

Is it also your observation? Do you know how GPU group was empty at first place?

@shchennu
Copy link
Author

shchennu commented May 4, 2025

While I agree this change makes sense, I’d like to better understand of how the issue occurred.

According to the error log, the binder failed to sync resource reservation pods during startup. Specifically, SyncForGpuGroup was called with an empty GPU group. This indicates that gpuGroup was an empty string, yet still present as a key in gpuGroupsMap. This situation can only happen if a pod has runai-gpu-group label set, but with an empty value.

Is it also your observation? Do you know how GPU group was empty at first place?

It was an observation:

  1. Issue Analysis:

    • The error occurs during binder startup when SyncForGpuGroup is called with an empty GPU group
    • This happens because there's a pod with a runai-gpu-group label that has an empty value
    • The empty value is still treated as a valid key in gpuGroupsMap, causing the sync to fail
  2. Root Cause Investigation:

    • I've observed this issue in our test environment where:
      • Some pods were created with runai-gpu-group: "" (empty string)
      • The binder tries to sync these pods but fails due to the empty group value

@enoodle
Copy link
Collaborator

enoodle commented May 4, 2025

Isn't a pod with empty GPU group a big issue that one would want the system to fail for or at least be loud about instead of just logging it? It means this pod will be stuck there forever attached to a GPU without actually reserving it.
I mean - isn't the issue that you have those pods in the first place instead of ignoring them.
Would that mean that if someone creates a pod with this annotation empty it could avoid a cleanup when it should be removed?

@davidLif
Copy link
Collaborator

davidLif commented May 5, 2025

I agree with @enoodle . We use the gpuGroup name extensively in our code base, and it would be better to fail if the name isn't set correctly.
I think the root issue causing the empty name is the error that we should fix (when we find it), rather then adding this validation.

@davidLif
Copy link
Collaborator

davidLif commented May 5, 2025

  • I've observed this issue in our test environment where:

    • Some pods were created with runai-gpu-group: "" (empty string)
    • The binder tries to sync these pods but fails due to the empty group value

Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods?

@shchennu
Copy link
Author

shchennu commented May 5, 2025

  • I've observed this issue in our test environment where:

    • Some pods were created with runai-gpu-group: "" (empty string)
    • The binder tries to sync these pods but fails due to the empty group value

Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods?

@enoodle @davidLif

Here is the Nvidia Case #00882449 for you to take a look at.

@davidLif
Copy link
Collaborator

davidLif commented May 6, 2025

  • I've observed this issue in our test environment where:

    • Some pods were created with runai-gpu-group: "" (empty string)
    • The binder tries to sync these pods but fails due to the empty group value

Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods?

@enoodle @davidLif

Here is the Nvidia Case #00882449 for you to take a look at.

I am not able to find the ticket.

Hello @shchennu ,
I would like to understand how you reached a case of a pod with an empty gpu group name. Can you please attach the yaml of the pod/bindingRequest this error happened for?

Recently we have upgraded from 2.19 to 2.21. This has caused the issue.

Can you send the relevant data here or contact the appropriate support directly? 2.19 and 2.21 aren't KAI versions. I guess you use the run:ai commercial product.

@romanbaron
Copy link
Collaborator

@shchennu - can you please update on your thoughts regarding the questions above?

@github-actions github-actions bot added the stale label Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants