feat: classify Kueue GPU admission failures with user-facing messages#732
feat: classify Kueue GPU admission failures with user-facing messages#732nbs-rh wants to merge 1 commit into
Conversation
… messages When a Kueue workload is inadmissible (QuotaReserved=False/Inadmissible), the reconciler now distinguishes GPU quota exhaustion from generic queue errors by inspecting the Job's pod spec and the Kueue condition message. GPU jobs that can't be admitted get message_code=gpu_unavailable with a human-readable explanation; all other admission failures use queue_error. This avoids surfacing raw cluster internals through the eval-hub API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Related to https://redhat.atlassian.net/browse/RHAIRFE-2171
When a Kueue workload is inadmissible (QuotaReserved=False/Inadmissible), the reconciler now distinguishes GPU quota exhaustion from generic queue errors by inspecting the Job's pod spec and the Kueue condition message.
GPU jobs that can't be admitted get message_code=gpu_unavailable with a human-readable explanation; all other admission failures use queue_error.
This avoids surfacing raw cluster internals through the eval-hub API.