Skip to content

feat: add aggressive kueue gating and enhanced verbose output#184

Merged
lburgazzoli merged 1 commit intored-hat-data-services:rhoai-3.3from
andyatmiami:feat/cherry-pick-kueue-checks
Mar 18, 2026
Merged

feat: add aggressive kueue gating and enhanced verbose output#184
lburgazzoli merged 1 commit intored-hat-data-services:rhoai-3.3from
andyatmiami:feat/cherry-pick-kueue-checks

Conversation

@andyatmiami
Copy link
Copy Markdown

@andyatmiami andyatmiami commented Mar 18, 2026

Cherry-pick of 95906e89792cdb3e4650c405b0595755f70c5e52 from main.
Original-PR: opendatahub-io#47

Replace 6 per-resource-type kueue label checks (Notebook, ISVC, LLMISVC, RayCluster, RayJob, PyTorchJob) with a single cluster-wide DataIntegrityCheck that validates three invariants:

  1. Every workload in a kueue-managed namespace has the queue-name label
  2. Every workload with the queue-name label is in a kueue-managed namespace
  3. Within each top-level CR's ownership tree, all resources agree on the queue-name label value

Key design decisions:

  • Top-down traversal from monitored CRs (Notebook, InferenceService, LLMInferenceService, RayCluster, RayJob, PyTorchJob) through an ownership graph of intermediate types (Deployment, StatefulSet, ReplicaSet, DaemonSet, Job, CronJob, Pod)
  • Ownership graph built once per namespace and reused across all CRs
  • Pre-filter to only process kueue-relevant namespaces (union of kueue-managed namespaces and namespaces with kueue-labeled workloads)
  • Single KueueConsistency condition per run with first-violation-wins per CR to avoid redundant diagnostics

Introduces ImpactProhibited as the highest severity level ("prohibited"), representing conditions where an upgrade MUST NOT proceed:

  • Kueue DataIntegrityCheck emits prohibited impact on label violations
  • Kueue OperatorInstalledCheck escalated to prohibited when Kueue managementState is Managed — there is no known, reliable upgrade path from embedded Kueue; migration to the Red Hat build of Kueue operator is required before upgrading regardless of whether RHBoK is currently installed. The previous behavior distinguished between Managed+operator-present (blocking) and Managed+operator-absent (pass); both cases now emit prohibited unconditionally

Prohibited severity support across the output layer:

  • SeverityLevelProhibited filter level for --severity flag
  • Prohibited banner (box-drawn border) rendered above the summary table when any prohibiteconditions are detected
  • Double exclamation mark (‼) status icon for prohibited conditions
  • PROHIBITED verdict in output (replaces FAIL for this severity)
  • Summary line now includes Prohibited count
  • Exit error message distinguishes prohibited from blocking

Refactors NotebookVerboseFormatter into EnhancedVerboseFormatter in pkg/lint/check/:

  • Handles mixed resource types by deriving per-object CRD FQN from TypeMeta via per-object AnnotationObjectCRDName, with result-level annotation preference for single-kind results
  • Exports CRDFullyQualifiedName and DeriveCRDFQNFromTypeMeta helpers
  • Supports optional per-object context via AnnotationObjectContext, rendered as a sub-bullet beneath each impacted object reference
  • All notebook checks updated to embed check.EnhancedVerboseFormatter (drop-in replacement, no behavioral change)

Adds resource type definitions for StatefulSet, ReplicaSet, DaemonSet, Job, and CronJob. Updates ToPartialObjectMetadata helper to preserve UID and OwnerReferences need for ownership graph construction.

Replace 6 per-resource-type kueue label checks (Notebook, ISVC,
LLMISVC, RayCluster, RayJob, PyTorchJob) with a single cluster-wide
DataIntegrityCheck that validates three invariants:

1. Every workload in a kueue-managed namespace has the queue-name label
2. Every workload with the queue-name label is in a kueue-managed namespace
3. Within each top-level CR's ownership tree, all resources agree on the
   queue-name label value

Key design decisions:

- Top-down traversal from monitored CRs (Notebook, InferenceService,
  LLMInferenceService, RayCluster, RayJob, PyTorchJob) through an
  ownership graph of intermediate types (Deployment, StatefulSet,
  ReplicaSet, DaemonSet, Job, CronJob, Pod)
- Ownership graph built once per namespace and reused across all CRs
- Pre-filter to only process kueue-relevant namespaces (union of
  kueue-managed namespaces and namespaces with kueue-labeled workloads)
- Single KueueConsistency condition per run with first-violation-wins
  per CR to avoid redundant diagnostics

Introduces ImpactProhibited as the highest severity level ("prohibited"),
representing conditions where an upgrade MUST NOT proceed:

- Kueue DataIntegrityCheck emits prohibited impact on label violations
- Kueue OperatorInstalledCheck escalated to prohibited when Kueue
  managementState is Managed — there is no known, reliable upgrade
  path from embedded Kueue; migration to the Red Hat build of Kueue
  operator is required before upgrading regardless of whether RHBoK
  is currently installed. The previous behavior distinguished between
  Managed+operator-present (blocking) and Managed+operator-absent (pass);
  both cases now emit prohibited unconditionally

Prohibited severity support across the output layer:

- SeverityLevelProhibited filter level for --severity flag
- Prohibited banner (box-drawn border) rendered above the summary
  table when any prohibiteconditions are detected
- Double exclamation mark (‼) status icon for prohibited conditions
- PROHIBITED verdict in output (replaces FAIL for this severity)
- Summary line now includes Prohibited count
- Exit error message distinguishes prohibited from blocking

Refactors NotebookVerboseFormatter into EnhancedVerboseFormatter
in pkg/lint/check/:

- Handles mixed resource types by deriving per-object CRD FQN from
  TypeMeta via per-object AnnotationObjectCRDName, with result-level
  annotation preference for single-kind results
- Exports CRDFullyQualifiedName and DeriveCRDFQNFromTypeMeta helpers
- Supports optional per-object context via AnnotationObjectContext,
  rendered as a sub-bullet beneath each impacted object reference
- All notebook checks updated to embed check.EnhancedVerboseFormatter
  (drop-in replacement, no behavioral change)

Adds resource type definitions for StatefulSet, ReplicaSet, DaemonSet,
Job, and CronJob. Updates ToPartialObjectMetadata helper to preserve
UID and OwnerReferences need for ownership graph construction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Andy Stoneberg <astonebe@redhat.com>
Copy link
Copy Markdown

@harshad16 harshad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

👍

@lburgazzoli lburgazzoli merged commit be15b0e into red-hat-data-services:rhoai-3.3 Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants