Skip to content

[Feature]: Improve validation errors for GPU sharing with actionable messages and supported values #1002

@kasia-kujawa

Description

@kasia-kujawa

Component

webhook

Problem Statement

When a GPU sharing strategy is configured incorrectly, the validation error message is not very helpful. For example, when MPS strategy is set but the MPSSupport feature gate is not enabled, the error is:
unknown GPU sharing strategy: MPS
This message doesn't tell you why it failed or what to do about it.

I personally spent time checking if mps (lowercase) should be used, looking for whitespace issues in my config, checking the MPS control daemon, before finally realizing the feature gate wasn't enabled 😄

Similarly, for an unknown time-slice interval, the error doesn't tell you what valid values are:
unknown time-slice interval: InvalidInterval

Proposed Solution

When a known strategy is used but its feature gate is not enabled, the error should say so clearly:
"MPS" is selected as the GPU sharing strategy, but the "MPSSupport" feature gate is not enabled

When an unknown strategy or interval is used, the error should list the supported values:

unknown GPU sharing strategy: foo, supported GPU sharing strategies: TimeSlicing, MPS
or
unknown time-slice interval: InvalidInterval, supported time-slice intervals: Default, Short, Medium, Long

Alternatives Considered

No response

Scope

Small: CLI flag, config option, minor behavior change

Upstream Kubernetes Dependencies

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions