Skip to content

refactor: single data-integrity check for kueue workloads#47

Merged
lburgazzoli merged 1 commit intoopendatahub-io:mainfrom
andyatmiami:feat/agressive-kueue-gating
Mar 18, 2026
Merged

refactor: single data-integrity check for kueue workloads#47
lburgazzoli merged 1 commit intoopendatahub-io:mainfrom
andyatmiami:feat/agressive-kueue-gating

Conversation

@andyatmiami
Copy link
Copy Markdown
Contributor

@andyatmiami andyatmiami commented Mar 12, 2026

Description

Replace 6 per-resource-type kueue label checks (Notebook, ISVC,
LLMISVC, RayCluster, RayJob, PyTorchJob) with a single cluster-wide
DataIntegrityCheck that validates three invariants:

  1. Every workload in a kueue-managed namespace has the queue-name label
  2. Every workload with the queue-name label is in a kueue-managed namespace
  3. Within each top-level CR's ownership tree, all resources agree on the
    queue-name label value

Key design decisions:

  • Top-down traversal from monitored CRs (Notebook, InferenceService,
    LLMInferenceService, RayCluster, RayJob, PyTorchJob) through an
    ownership graph of intermediate types (Deployment, StatefulSet,
    ReplicaSet, DaemonSet, Job, CronJob, Pod)
  • Ownership graph built once per namespace and reused across all CRs
  • Pre-filter to only process kueue-relevant namespaces (union of
    kueue-managed namespaces and namespaces with kueue-labeled workloads)
  • Single KueueConsistency condition per run with first-violation-wins
    per CR to avoid redundant diagnostics

Introduces ImpactProhibited as the highest severity level ("prohibited"),
representing conditions where an upgrade MUST NOT proceed:

  • Kueue DataIntegrityCheck emits prohibited impact on label violations
  • Kueue OperatorInstalledCheck escalated to prohibited when Kueue
    managementState is Managed — there is no known, reliable upgrade
    path from embedded Kueue; migration to the Red Hat build of Kueue
    operator is required before upgrading regardless of whether RHBoK
    is currently installed. The previous behavior distinguished between
    Managed+operator-present (blocking) and Managed+operator-absent (pass);
    both cases now emit prohibited unconditionally

Prohibited severity support across the output layer:

  • SeverityLevelProhibited filter level for --severity flag
  • Prohibited banner (box-drawn border) rendered above the summary
    table when any prohibiteconditions are detected
  • Double exclamation mark (‼) status icon for prohibited conditions
  • PROHIBITED verdict in output (replaces FAIL for this severity)
  • Summary line now includes Prohibited count
  • Exit error message distinguishes prohibited from blocking

Refactors NotebookVerboseFormatter into EnhancedVerboseFormatter
in pkg/lint/check/:

  • Handles mixed resource types by deriving per-object CRD FQN from
    TypeMeta via per-object AnnotationObjectCRDName, with result-level
    annotation preference for single-kind results
  • Exports CRDFullyQualifiedName and DeriveCRDFQNFromTypeMeta helpers
  • Supports optional per-object context via AnnotationObjectContext,
    rendered as a sub-bullet beneath each impacted object reference
  • All notebook checks updated to embed check.EnhancedVerboseFormatter
    (drop-in replacement, no behavioral change)

Adds resource type definitions for StatefulSet, ReplicaSet, DaemonSet,
Job, and CronJob. Updates ToPartialObjectMetadata helper to preserve
UID and OwnerReferences need for ownership graph construction.

How Has This Been Tested?

Kueue Unmanaged with "inconsistent" Kueue state in environment

  1. Create a 2.25.3 RHOAI instance
  2. Create sample data that includes all the "invariant checks" we want to guard against
  3. ➜ odh-cli/ git:(feat/agressive-kueue-gating) $ ./bin/kubectl-odh lint --target-version 3.3 --verbose

Look for prohibited severity in output

TOP LEVEL BANNER

image

SUMMARY TABLE

image

VERBOSE OUTPUT TABLE

image

RESULT OUTPUT

image

Kueue Managed

  1. Leverage existing BU cluster ( the "big" / "bad" one I use frequently for regression testing)
    • https://console-openshift-console.apps.prod.rhoai.rh-aiservices-bu.com
  2. ➜ odh-cli/ git:(feat/agressive-kueue-gating) $ ./bin/kubectl-odh lint --target-version 3.3 --verbose --debug

Look for prohibited severity in output

SUMMARY TABLE

image

Kueue Removed (and/or not present)

  1. Create a 2.25.3 RHOAI instance *without Kueue
  2. Create simple kueue-less sample data
  3. ➜ odh-cli/ git:(feat/agressive-kueue-gating) $ ./bin/kubectl-odh lint --target-version 3.3 --verbose

SUMMARY TABLE

image

VERBOSE TABLE

image

RESULT OUTPUT

image

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added a Prohibited impact level that blocks upgrades and surfaces a prominent "Prohibited Violations" banner
    • Enhanced diagnostic output with CRD-qualified object names, per-namespace grouping, and object context
    • Consolidated Kueue validations into a unified cluster-wide data-integrity check
  • Bug Fixes

    • Improved impact computation and validation messaging (includes Prohibited alongside Blocking/Advisory)
    • Broadened workload resource coverage (StatefulSet/ReplicaSet/DaemonSet/Job/CronJob)

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

This PR adds a Prohibited impact level and object-level annotation keys to the result package; implements an EnhancedVerboseFormatter and replaces multiple notebook-specific formatters; consolidates several Kueue per-resource label checks into a single cluster-wide DataIntegrityCheck with ownership-graph traversal and three invariants; removes many per-resource Kueue check files and tests; adds several Kubernetes ResourceType definitions; enriches PartialObjectMetadata with UID and OwnerReferences; and updates severity handling, output formatting, and registration to surface Prohibited findings.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Issues

  • Fragile CRD pluralization heuristic (CWE-20: Improper Input Validation)
    File: pkg/lint/check/verbose_formatter.go, function DeriveCRDFQNFromTypeMeta
    Risk: irregular plural forms produce incorrect CRD FQNs and silent lookup failures. Action: replace heuristic with API-backed discovery (list CRDs or query server for resource info) or validate derived plural against the server before use.

  • NULL pointer dereference risk (CWE-476: NULL Pointer Dereference)
    File: pkg/lint/checks/workloads/kueue/graph.go, function newGraphNode
    Risk: accessing item.OwnerReferences without guarding against nil PartialObjectMetadata or missing ObjectMeta fields. Action: add nil checks and defensive handling for missing metadata; fail gracefully with contextual error.

  • RBAC/visibility gap in cluster-wide enumeration (CWE-732/CWE-276: Incorrect Permission Assignment / Least Privilege Violation)
    File: pkg/lint/checks/workloads/kueue/check.go, method Validate (listWorkloadsInNamespace)
    Risk: results may be incomplete under limited permissions, producing false negatives. Action: detect/listing errors that indicate permission denial and surface a blocking condition or annotate results explaining partial visibility; document required permissions.

  • Unsanitized label comparisons (CWE-20: Improper Input Validation)
    File: pkg/lint/checks/workloads/kueue/invariants.go, function checkOwnerTreeConsistency
    Risk: raw label values compared without normalization or verification, causing spurious mismatch violations or enabling attacker-controlled labels to influence checks. Action: canonicalize label values, validate allowed characters, and where relevant verify referenced queues exist (or demote to advisory if verification is not possible).

  • Behavioral regression in Managed-state handling (not strictly security but breaking behavioral change)
    File: pkg/lint/checks/components/kueue/kueue_operator_installed.go, function validateManaged
    Risk: previous logic allowed Managed+RedHat build; new code marks all Managed as Prohibited. Action: confirm intent with stakeholders; if unintended, restore conditional behavior or document the stricter policy and adjust callers/tests accordingly.

  • Insufficient error context when listing workloads (CWE-703: Incorrect Handling of Exceptional Conditions)
    File: pkg/lint/checks/workloads/kueue/check.go, function listWorkloadsInNamespace
    Risk: non-CRD errors are returned without namespace/resource context, hindering debugging. Action: wrap returned errors with namespace and resource type context (fmt.Errorf or errors.Wrap) before returning.

Only actionable findings are listed. Adjustments above focus on crash-safety, accurate diagnostics under RBAC, and robustness of CRD/resource resolution.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the primary refactoring: replacing six per-resource kueue checks with a single data-integrity check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@andyatmiami andyatmiami force-pushed the feat/agressive-kueue-gating branch 3 times, most recently from 5d1f0da to 4f3f291 Compare March 13, 2026 20:21
@valdar valdar requested a review from lburgazzoli March 14, 2026 08:03
@andyatmiami andyatmiami force-pushed the feat/agressive-kueue-gating branch from 53359bb to 5934ee6 Compare March 17, 2026 21:20
@andyatmiami andyatmiami marked this pull request as ready for review March 17, 2026 21:20
@andyatmiami
Copy link
Copy Markdown
Contributor Author

/hold

I want to include some screenshots from my own testing - but also want to take this PR out of Draft to signify its ready for review.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
pkg/lint/command_options.go (1)

319-332: Default case returns true, which is maximally permissive.

If a future SeverityLevel value is introduced but not handled here, the fallback silently passes everything. Consider returning false to fail closed, or explicitly panic/log on unexpected values.

Suggested change
 func meetsMinSeverity(impact result.Impact, minLevel SeverityLevel) bool {
 	switch minLevel {
 	case SeverityLevelProhibited:
 		return impact == result.ImpactProhibited
 	case SeverityLevelCritical:
 		return impact == result.ImpactProhibited || impact == result.ImpactBlocking
 	case SeverityLevelWarning:
 		return impact == result.ImpactProhibited || impact == result.ImpactBlocking || impact == result.ImpactAdvisory
 	case SeverityLevelInfo:
 		return true
 	}
 
-	return true
+	// Unknown severity level — fail closed to avoid silently skipping findings.
+	return false
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lint/command_options.go` around lines 319 - 332, The meetsMinSeverity
function currently falls through to return true for any unhandled SeverityLevel,
which is unsafe; update the switch in meetsMinSeverity to handle all known
SeverityLevel cases and change the default/fallback to return false
(fail-closed) or explicitly log/panic on an unexpected SeverityLevel value so
unknown levels do not silently pass; reference the SeverityLevel enum and
result.Impact constants (e.g., SeverityLevelProhibited, SeverityLevelCritical,
SeverityLevelWarning, SeverityLevelInfo) when updating the switch to ensure
exhaustive handling.
pkg/lint/checks/workloads/kueue/check.go (4)

306-317: buildCRDFQNLookup is rebuilt on every Validate call.

monitoredWorkloadTypes is package-level and static. Consider computing the lookup once at init or lazily with sync.Once to avoid repeated map construction.

Example with sync.Once
var (
	crdfqnLookupOnce sync.Once
	crdfqnLookup     map[string]string
)

func getCRDFQNLookup() map[string]string {
	crdfqnLookupOnce.Do(func() {
		crdfqnLookup = make(map[string]string, len(monitoredWorkloadTypes))
		for _, rt := range monitoredWorkloadTypes {
			key := rt.APIVersion() + "/" + rt.Kind
			crdfqnLookup[key] = rt.CRDFQN()
		}
	})
	return crdfqnLookup
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lint/checks/workloads/kueue/check.go` around lines 306 - 317,
buildCRDFQNLookup currently reconstructs the map on every Validate call even
though monitoredWorkloadTypes is static; change this to compute the map once
(either at package init or lazily) by introducing package-level variables (e.g.,
crdfqnLookup and crdfqnLookupOnce) and a getter like getCRDFQNLookup that uses
sync.Once to populate crdfqnLookup from monitoredWorkloadTypes, then update
callers (e.g., Validate) to call getCRDFQNLookup instead of buildCRDFQNLookup or
replace buildCRDFQNLookup with this lazy initializer so the map is not rebuilt
repeatedly.

157-176: First-violation-wins per CR drops invariant 3 checks when invariants 1 or 2 fail.

This is documented behavior, but it means a single CR can have a namespace mismatch and an ownership-tree inconsistency, yet only the former surfaces. If full diagnostics are ever desired, consider accumulating all violations before continue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lint/checks/workloads/kueue/check.go` around lines 157 - 176, Current
loop uses early continues after appending a violation from
checkNamespaceToWorkload or checkWorkloadToNamespace, which prevents
checkOwnerTreeConsistency from running for the same CR; change the loop so all
three checks run for each cr and each non-nil result is appended to violations
(i.e., call checkNamespaceToWorkload(cr, kueueNamespaces),
checkWorkloadToNamespace(cr, kueueNamespaces) and checkOwnerTreeConsistency(cr,
graph) without using continue after the first append) so multiple violations per
CR are collected.

229-264: uniqueResources keeps the first violation's message per resource.

If a resource later violates invariant 3 after invariant 1, only the first message is retained. Given the earlier first-violation-wins logic in checkNamespace, this is consistent—but document the coupling to avoid future drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lint/checks/workloads/kueue/check.go` around lines 229 - 264, The
uniqueResources function currently deduplicates violations by resource identity
and keeps the first violation's Message for each resource; explicitly document
this coupling to checkNamespace so future changes don't change semantics. Update
the comment/docstring for uniqueResources (and optionally add a short note on
checkNamespace) to state that when multiple violations exist for the same
resource the first occurrence wins and its Message is retained, referencing the
types and symbols involved (uniqueResources, checkNamespace, violation,
impactedResource) so maintainers understand the intended behavior and won't
unknowingly reorder or filter violations.

186-217: Slice all grows via repeated append without pre-allocation.

Minor allocation churn. If monitoredWorkloadTypes and per-type counts are bounded, this is acceptable. Otherwise, consider collecting counts first or using a pooled buffer.

Optional pre-allocation sketch
 func listWorkloadsInNamespace(
 	ctx context.Context,
 	r client.Reader,
 	namespace string,
 ) ([]*metav1.PartialObjectMetadata, error) {
-	var all []*metav1.PartialObjectMetadata
+	// Estimate initial capacity; exact counts unknown until iteration.
+	all := make([]*metav1.PartialObjectMetadata, 0, 64)
 
 	for _, rt := range monitoredWorkloadTypes {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lint/checks/workloads/kueue/check.go` around lines 186 - 217, The slice
variable all in the loop grows via repeated append which causes extra
allocations; before appending each batch of items returned by r.ListMetadata
(the items variable inside the monitoredWorkloadTypes loop), ensure capacity is
pre-reserved by checking if cap(all) < len(all)+len(items) and, if so, allocate
a new slice with the larger capacity and copy the existing elements into it so
subsequent appends don't reallocate; adjust this logic around the loop that
constructs metav1.PartialObjectMetadata objects (the obj creation block) so you
pre-expand all's capacity once per batch rather than letting append repeatedly
grow it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/lint/checks/components/kueue/kueue_operator_installed.go`:
- Around line 84-99: Validate() currently returns early when olm.FindOperator()
errors, which prevents emitting the prohibited condition from validateManaged
for managementState=Managed; update Validate() (in OperatorInstalledCheck) to
ensure validateManaged(req, ...) is invoked even if olm.FindOperator() fails (or
call validateManaged before returning on that error), so the prohibited
condition for managementState Managed is always set; reference validateManaged,
Validate, OperatorInstalledCheck, and olm.FindOperator to locate and adjust the
control flow accordingly.

---

Nitpick comments:
In `@pkg/lint/checks/workloads/kueue/check.go`:
- Around line 306-317: buildCRDFQNLookup currently reconstructs the map on every
Validate call even though monitoredWorkloadTypes is static; change this to
compute the map once (either at package init or lazily) by introducing
package-level variables (e.g., crdfqnLookup and crdfqnLookupOnce) and a getter
like getCRDFQNLookup that uses sync.Once to populate crdfqnLookup from
monitoredWorkloadTypes, then update callers (e.g., Validate) to call
getCRDFQNLookup instead of buildCRDFQNLookup or replace buildCRDFQNLookup with
this lazy initializer so the map is not rebuilt repeatedly.
- Around line 157-176: Current loop uses early continues after appending a
violation from checkNamespaceToWorkload or checkWorkloadToNamespace, which
prevents checkOwnerTreeConsistency from running for the same CR; change the loop
so all three checks run for each cr and each non-nil result is appended to
violations (i.e., call checkNamespaceToWorkload(cr, kueueNamespaces),
checkWorkloadToNamespace(cr, kueueNamespaces) and checkOwnerTreeConsistency(cr,
graph) without using continue after the first append) so multiple violations per
CR are collected.
- Around line 229-264: The uniqueResources function currently deduplicates
violations by resource identity and keeps the first violation's Message for each
resource; explicitly document this coupling to checkNamespace so future changes
don't change semantics. Update the comment/docstring for uniqueResources (and
optionally add a short note on checkNamespace) to state that when multiple
violations exist for the same resource the first occurrence wins and its Message
is retained, referencing the types and symbols involved (uniqueResources,
checkNamespace, violation, impactedResource) so maintainers understand the
intended behavior and won't unknowingly reorder or filter violations.
- Around line 186-217: The slice variable all in the loop grows via repeated
append which causes extra allocations; before appending each batch of items
returned by r.ListMetadata (the items variable inside the monitoredWorkloadTypes
loop), ensure capacity is pre-reserved by checking if cap(all) <
len(all)+len(items) and, if so, allocate a new slice with the larger capacity
and copy the existing elements into it so subsequent appends don't reallocate;
adjust this logic around the loop that constructs metav1.PartialObjectMetadata
objects (the obj creation block) so you pre-expand all's capacity once per batch
rather than letting append repeatedly grow it.

In `@pkg/lint/command_options.go`:
- Around line 319-332: The meetsMinSeverity function currently falls through to
return true for any unhandled SeverityLevel, which is unsafe; update the switch
in meetsMinSeverity to handle all known SeverityLevel cases and change the
default/fallback to return false (fail-closed) or explicitly log/panic on an
unexpected SeverityLevel value so unknown levels do not silently pass; reference
the SeverityLevel enum and result.Impact constants (e.g.,
SeverityLevelProhibited, SeverityLevelCritical, SeverityLevelWarning,
SeverityLevelInfo) when updating the switch to ensure exhaustive handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 778554f0-3c23-4f6f-a463-486dc880370f

📥 Commits

Reviewing files that changed from the base of the PR and between 7d65e7f and 5934ee6.

📒 Files selected for processing (39)
  • pkg/lint/check/result/result.go
  • pkg/lint/check/result/result_test.go
  • pkg/lint/check/verbose_formatter.go
  • pkg/lint/check/verbose_formatter_test.go
  • pkg/lint/checks/components/kueue/kueue_operator_installed.go
  • pkg/lint/checks/components/kueue/kueue_operator_installed_test.go
  • pkg/lint/checks/workloads/kueue/check.go
  • pkg/lint/checks/workloads/kueue/check_test.go
  • pkg/lint/checks/workloads/kueue/graph.go
  • pkg/lint/checks/workloads/kueue/invariants.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_isvc.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_isvc_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_llmisvc.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_llmisvc_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_notebook.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_notebook_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_raycluster.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_raycluster_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_rayjob.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_rayjob_test.go
  • pkg/lint/checks/workloads/kueue/support.go
  • pkg/lint/checks/workloads/notebook/accelerator_migration.go
  • pkg/lint/checks/workloads/notebook/connection_integrity.go
  • pkg/lint/checks/workloads/notebook/container_name.go
  • pkg/lint/checks/workloads/notebook/hardware_profile_integrity.go
  • pkg/lint/checks/workloads/notebook/hardwareprofile_migration.go
  • pkg/lint/checks/workloads/notebook/impacted.go
  • pkg/lint/checks/workloads/notebook/running_workloads.go
  • pkg/lint/checks/workloads/notebook/verbose_formatter.go
  • pkg/lint/command.go
  • pkg/lint/command_options.go
  • pkg/lint/command_options_test.go
  • pkg/lint/constants.go
  • pkg/lint/output_table.go
  • pkg/lint/output_table_test.go
  • pkg/resources/types.go
  • pkg/util/kube/resources.go
💤 Files with no reviewable changes (13)
  • pkg/lint/checks/workloads/notebook/verbose_formatter.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_raycluster.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_notebook.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_rayjob_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_isvc_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_llmisvc_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_notebook_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_rayjob.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_isvc.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_raycluster_test.go
  • pkg/lint/checks/workloads/kueue/kueue_labels_llmisvc.go

Comment thread pkg/lint/checks/components/kueue/kueue_operator_installed.go
Replace 6 per-resource-type kueue label checks (Notebook, ISVC,
LLMISVC, RayCluster, RayJob, PyTorchJob) with a single cluster-wide
DataIntegrityCheck that validates three invariants:

1. Every workload in a kueue-managed namespace has the queue-name label
2. Every workload with the queue-name label is in a kueue-managed namespace
3. Within each top-level CR's ownership tree, all resources agree on the
   queue-name label value

Key design decisions:

- Top-down traversal from monitored CRs (Notebook, InferenceService,
  LLMInferenceService, RayCluster, RayJob, PyTorchJob) through an
  ownership graph of intermediate types (Deployment, StatefulSet,
  ReplicaSet, DaemonSet, Job, CronJob, Pod)
- Ownership graph built once per namespace and reused across all CRs
- Pre-filter to only process kueue-relevant namespaces (union of
  kueue-managed namespaces and namespaces with kueue-labeled workloads)
- Single KueueConsistency condition per run with first-violation-wins
  per CR to avoid redundant diagnostics

Introduces ImpactProhibited as the highest severity level ("prohibited"),
representing conditions where an upgrade MUST NOT proceed:

- Kueue DataIntegrityCheck emits prohibited impact on label violations
- Kueue OperatorInstalledCheck escalated to prohibited when Kueue
  managementState is Managed — there is no known, reliable upgrade
  path from embedded Kueue; migration to the Red Hat build of Kueue
  operator is required before upgrading regardless of whether RHBoK
  is currently installed. The previous behavior distinguished between
  Managed+operator-present (blocking) and Managed+operator-absent (pass);
  both cases now emit prohibited unconditionally

Prohibited severity support across the output layer:

- SeverityLevelProhibited filter level for --severity flag
- Prohibited banner (box-drawn border) rendered above the summary
  table when any prohibiteconditions are detected
- Double exclamation mark (‼) status icon for prohibited conditions
- PROHIBITED verdict in output (replaces FAIL for this severity)
- Summary line now includes Prohibited count
- Exit error message distinguishes prohibited from blocking

Refactors NotebookVerboseFormatter into EnhancedVerboseFormatter
in pkg/lint/check/:

- Handles mixed resource types by deriving per-object CRD FQN from
  TypeMeta via per-object AnnotationObjectCRDName, with result-level
  annotation preference for single-kind results
- Exports CRDFullyQualifiedName and DeriveCRDFQNFromTypeMeta helpers
- Supports optional per-object context via AnnotationObjectContext,
  rendered as a sub-bullet beneath each impacted object reference
- All notebook checks updated to embed check.EnhancedVerboseFormatter
  (drop-in replacement, no behavioral change)

Adds resource type definitions for StatefulSet, ReplicaSet, DaemonSet,
Job, and CronJob. Updates ToPartialObjectMetadata helper to preserve
UID and OwnerReferences need for ownership graph construction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Andy Stoneberg <astonebe@redhat.com>
@andyatmiami andyatmiami force-pushed the feat/agressive-kueue-gating branch from 5934ee6 to e421b6e Compare March 17, 2026 21:37
@lburgazzoli lburgazzoli merged commit 95906e8 into opendatahub-io:main Mar 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants