Fix invalid label key format in GPU group handling. #110

shchennu · 2025-04-30T19:24:44Z

Fix Invalid Label Key Format in GPU Group Handling

Issue Description

When starting the KAI Scheduler Binder, the resource reservation sync process is failing with the following error:

ERROR setup unable to sync resource reservation {"error": "unable to parse requirement: <nil>: Invalid value: \"runai-gpu-group/\": name part must be non-empty; name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')"}

This occurs when the scheduler tries to use a label selector with an empty value after the slash (runai-gpu-group/), which violates Kubernetes naming rules for label values. The specific problem happens when handling a GPU group with an empty name during the resource reservation sync process.

Root Cause

In pkg/common/resources/gpu_sharing.go, the GetMultiFractionGpuGroupLabel function concatenates the constant prefix MultiGpuGroupLabelPrefix (which is already "runai-gpu-group/") with a GPU group name. When the GPU group name is empty, this results in a label key with a trailing slash but no value, which Kubernetes rejects as invalid.

func GetMultiFractionGpuGroupLabel(gpuGroup string) (string, string) {
    return constants.MultiGpuGroupLabelPrefix + gpuGroup, gpuGroup
}

This function is called in pkg/binder/binding/resourcereservation/resource_reservation.go during the Sync process, and when the GPU group is empty, it tries to create an invalid label selector.

Changes

This PR fixes the issue by properly handling the case of empty GPU group names in one of two ways:

Option 1: Skip processing empty GPU group names in the sync process:

func (rsc *service) syncForGpuGroupWithLock(ctx context.Context, gpuGroup string) error {
    if gpuGroup == "" {
        // Skip empty GPU groups
        return nil
    }
    // Rest of the function...
}

Option 2: Ensure GetMultiFractionGpuGroupLabel always returns a valid label key-value pair:

func GetMultiFractionGpuGroupLabel(gpuGroup string) (string, string) {
    if gpuGroup == "" {
        return constants.GPUGroup, "default"
    }
    return constants.MultiGpuGroupLabelPrefix + gpuGroup, gpuGroup
}

We've chosen to implement Option 1 as it's a more conservative change that preserves the original behavior while preventing the error condition.

Testing Done

Verified that the binder starts successfully with the fix
Confirmed that normal GPU fractional scheduling continues to work as expected
Verified compatibility with existing deployments

Impact

This fix resolves the startup error in environments where the binder tries to process GPU groups with empty names. It prevents the error without changing the behavior of GPU fractional scheduling for valid GPU group names.

pkg/binder/binding/resourcereservation/resource_reservation.go

davidLif

Hello @shchennu ,

I would like to understand how you reached a case of a pod with an empty gpu group name.
Can you please attach the yaml of the pod/bindingRequest this error happened for?

shchennu · 2025-05-04T13:53:54Z

Hello @shchennu ,

I would like to understand how you reached a case of a pod with an empty gpu group name. Can you please attach the yaml of the pod/bindingRequest this error happened for?

Recently we have upgraded from 2.19 to 2.21.
This has caused the issue.

romanbaron · 2025-05-04T18:56:05Z

While I agree this change makes sense, I’d like to better understand of how the issue occurred.

According to the error log, the binder failed to sync resource reservation pods during startup. Specifically, SyncForGpuGroup was called with an empty GPU group. This indicates that gpuGroup was an empty string, yet still present as a key in gpuGroupsMap. This situation can only happen if a pod has runai-gpu-group label set, but with an empty value.

Is it also your observation? Do you know how GPU group was empty at first place?

shchennu · 2025-05-04T20:49:10Z

While I agree this change makes sense, I’d like to better understand of how the issue occurred.

According to the error log, the binder failed to sync resource reservation pods during startup. Specifically, SyncForGpuGroup was called with an empty GPU group. This indicates that gpuGroup was an empty string, yet still present as a key in gpuGroupsMap. This situation can only happen if a pod has runai-gpu-group label set, but with an empty value.

Is it also your observation? Do you know how GPU group was empty at first place?

It was an observation:

Issue Analysis:
- The error occurs during binder startup when SyncForGpuGroup is called with an empty GPU group
- This happens because there's a pod with a runai-gpu-group label that has an empty value
- The empty value is still treated as a valid key in gpuGroupsMap, causing the sync to fail
Root Cause Investigation:
- I've observed this issue in our test environment where:
  - Some pods were created with runai-gpu-group: "" (empty string)
  - The binder tries to sync these pods but fails due to the empty group value

enoodle · 2025-05-04T21:35:02Z

Isn't a pod with empty GPU group a big issue that one would want the system to fail for or at least be loud about instead of just logging it? It means this pod will be stuck there forever attached to a GPU without actually reserving it.
I mean - isn't the issue that you have those pods in the first place instead of ignoring them.
Would that mean that if someone creates a pod with this annotation empty it could avoid a cleanup when it should be removed?

davidLif · 2025-05-05T08:45:43Z

I agree with @enoodle . We use the gpuGroup name extensively in our code base, and it would be better to fail if the name isn't set correctly.
I think the root issue causing the empty name is the error that we should fix (when we find it), rather then adding this validation.

davidLif · 2025-05-05T08:49:35Z

I've observed this issue in our test environment where:

Some pods were created with runai-gpu-group: "" (empty string)

The binder tries to sync these pods but fails due to the empty group value

Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods?

shchennu · 2025-05-05T16:28:22Z

I've observed this issue in our test environment where:

Some pods were created with runai-gpu-group: "" (empty string)

The binder tries to sync these pods but fails due to the empty group value

Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods?

@enoodle @davidLif

Here is the Nvidia Case #00882449 for you to take a look at.

davidLif · 2025-05-06T08:09:18Z

I've observed this issue in our test environment where:

Some pods were created with runai-gpu-group: "" (empty string)

The binder tries to sync these pods but fails due to the empty group value

Can you share more details about that? What are the pod-grouper logs? How the pods look like? Are podGroups being created for these pods?

@enoodle @davidLif

Here is the Nvidia Case #00882449 for you to take a look at.

I am not able to find the ticket.

Hello @shchennu ,
I would like to understand how you reached a case of a pod with an empty gpu group name. Can you please attach the yaml of the pod/bindingRequest this error happened for?

Recently we have upgraded from 2.19 to 2.21. This has caused the issue.

Can you send the relevant data here or contact the appropriate support directly? 2.19 and 2.21 aren't KAI versions. I guess you use the run:ai commercial product.

romanbaron · 2025-05-21T07:56:54Z

@shchennu - can you please update on your thoughts regarding the questions above?

shchennu force-pushed the fix_invalid_label_key branch from eae5182 to 367ec33 Compare May 1, 2025 04:39

Fix invalid label key format in GPU group handling.

391e8a9

shchennu force-pushed the fix_invalid_label_key branch from 367ec33 to 391e8a9 Compare May 1, 2025 16:14

romanbaron reviewed May 4, 2025

View reviewed changes

pkg/binder/binding/resourcereservation/resource_reservation.go Outdated Show resolved Hide resolved

davidLif reviewed May 4, 2025

View reviewed changes

apply feedback for logger.

515afbc

shchennu requested review from davidLif and romanbaron May 4, 2025 16:41

github-actions bot added the stale label Oct 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix invalid label key format in GPU group handling. #110

Fix invalid label key format in GPU group handling. #110

Uh oh!

shchennu commented Apr 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

davidLif left a comment

Uh oh!

shchennu commented May 4, 2025

Uh oh!

romanbaron commented May 4, 2025 •

edited

Loading

Uh oh!

shchennu commented May 4, 2025

Uh oh!

enoodle commented May 4, 2025

Uh oh!

davidLif commented May 5, 2025

Uh oh!

davidLif commented May 5, 2025

Uh oh!

shchennu commented May 5, 2025

Uh oh!

davidLif commented May 6, 2025

Uh oh!

romanbaron commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix invalid label key format in GPU group handling. #110

Are you sure you want to change the base?

Fix invalid label key format in GPU group handling. #110

Uh oh!

Conversation

shchennu commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix Invalid Label Key Format in GPU Group Handling

Issue Description

Root Cause

Changes

Testing Done

Impact

Uh oh!

Uh oh!

davidLif left a comment

Choose a reason for hiding this comment

Uh oh!

shchennu commented May 4, 2025

Uh oh!

romanbaron commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shchennu commented May 4, 2025

Uh oh!

enoodle commented May 4, 2025

Uh oh!

davidLif commented May 5, 2025

Uh oh!

davidLif commented May 5, 2025

Uh oh!

shchennu commented May 5, 2025

Uh oh!

davidLif commented May 6, 2025

Uh oh!

romanbaron commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shchennu commented Apr 30, 2025 •

edited

Loading

romanbaron commented May 4, 2025 •

edited

Loading