refactor: single data-integrity check for kueue workloads#47
Conversation
📝 WalkthroughWalkthroughThis PR adds a Prohibited impact level and object-level annotation keys to the result package; implements an EnhancedVerboseFormatter and replaces multiple notebook-specific formatters; consolidates several Kueue per-resource label checks into a single cluster-wide DataIntegrityCheck with ownership-graph traversal and three invariants; removes many per-resource Kueue check files and tests; adds several Kubernetes ResourceType definitions; enriches PartialObjectMetadata with UID and OwnerReferences; and updates severity handling, output formatting, and registration to surface Prohibited findings. Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Issues
Only actionable findings are listed. Adjustments above focus on crash-safety, accurate diagnostics under RBAC, and robustness of CRD/resource resolution. 🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5d1f0da to
4f3f291
Compare
53359bb to
5934ee6
Compare
|
/hold I want to include some screenshots from my own testing - but also want to take this PR out of |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
pkg/lint/command_options.go (1)
319-332: Default case returnstrue, which is maximally permissive.If a future
SeverityLevelvalue is introduced but not handled here, the fallback silently passes everything. Consider returningfalseto fail closed, or explicitly panic/log on unexpected values.Suggested change
func meetsMinSeverity(impact result.Impact, minLevel SeverityLevel) bool { switch minLevel { case SeverityLevelProhibited: return impact == result.ImpactProhibited case SeverityLevelCritical: return impact == result.ImpactProhibited || impact == result.ImpactBlocking case SeverityLevelWarning: return impact == result.ImpactProhibited || impact == result.ImpactBlocking || impact == result.ImpactAdvisory case SeverityLevelInfo: return true } - return true + // Unknown severity level — fail closed to avoid silently skipping findings. + return false }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lint/command_options.go` around lines 319 - 332, The meetsMinSeverity function currently falls through to return true for any unhandled SeverityLevel, which is unsafe; update the switch in meetsMinSeverity to handle all known SeverityLevel cases and change the default/fallback to return false (fail-closed) or explicitly log/panic on an unexpected SeverityLevel value so unknown levels do not silently pass; reference the SeverityLevel enum and result.Impact constants (e.g., SeverityLevelProhibited, SeverityLevelCritical, SeverityLevelWarning, SeverityLevelInfo) when updating the switch to ensure exhaustive handling.pkg/lint/checks/workloads/kueue/check.go (4)
306-317:buildCRDFQNLookupis rebuilt on everyValidatecall.
monitoredWorkloadTypesis package-level and static. Consider computing the lookup once at init or lazily withsync.Onceto avoid repeated map construction.Example with sync.Once
var ( crdfqnLookupOnce sync.Once crdfqnLookup map[string]string ) func getCRDFQNLookup() map[string]string { crdfqnLookupOnce.Do(func() { crdfqnLookup = make(map[string]string, len(monitoredWorkloadTypes)) for _, rt := range monitoredWorkloadTypes { key := rt.APIVersion() + "/" + rt.Kind crdfqnLookup[key] = rt.CRDFQN() } }) return crdfqnLookup }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lint/checks/workloads/kueue/check.go` around lines 306 - 317, buildCRDFQNLookup currently reconstructs the map on every Validate call even though monitoredWorkloadTypes is static; change this to compute the map once (either at package init or lazily) by introducing package-level variables (e.g., crdfqnLookup and crdfqnLookupOnce) and a getter like getCRDFQNLookup that uses sync.Once to populate crdfqnLookup from monitoredWorkloadTypes, then update callers (e.g., Validate) to call getCRDFQNLookup instead of buildCRDFQNLookup or replace buildCRDFQNLookup with this lazy initializer so the map is not rebuilt repeatedly.
157-176: First-violation-wins per CR drops invariant 3 checks when invariants 1 or 2 fail.This is documented behavior, but it means a single CR can have a namespace mismatch and an ownership-tree inconsistency, yet only the former surfaces. If full diagnostics are ever desired, consider accumulating all violations before
continue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lint/checks/workloads/kueue/check.go` around lines 157 - 176, Current loop uses early continues after appending a violation from checkNamespaceToWorkload or checkWorkloadToNamespace, which prevents checkOwnerTreeConsistency from running for the same CR; change the loop so all three checks run for each cr and each non-nil result is appended to violations (i.e., call checkNamespaceToWorkload(cr, kueueNamespaces), checkWorkloadToNamespace(cr, kueueNamespaces) and checkOwnerTreeConsistency(cr, graph) without using continue after the first append) so multiple violations per CR are collected.
229-264:uniqueResourceskeeps the first violation's message per resource.If a resource later violates invariant 3 after invariant 1, only the first message is retained. Given the earlier first-violation-wins logic in
checkNamespace, this is consistent—but document the coupling to avoid future drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lint/checks/workloads/kueue/check.go` around lines 229 - 264, The uniqueResources function currently deduplicates violations by resource identity and keeps the first violation's Message for each resource; explicitly document this coupling to checkNamespace so future changes don't change semantics. Update the comment/docstring for uniqueResources (and optionally add a short note on checkNamespace) to state that when multiple violations exist for the same resource the first occurrence wins and its Message is retained, referencing the types and symbols involved (uniqueResources, checkNamespace, violation, impactedResource) so maintainers understand the intended behavior and won't unknowingly reorder or filter violations.
186-217: Sliceallgrows via repeatedappendwithout pre-allocation.Minor allocation churn. If
monitoredWorkloadTypesand per-type counts are bounded, this is acceptable. Otherwise, consider collecting counts first or using a pooled buffer.Optional pre-allocation sketch
func listWorkloadsInNamespace( ctx context.Context, r client.Reader, namespace string, ) ([]*metav1.PartialObjectMetadata, error) { - var all []*metav1.PartialObjectMetadata + // Estimate initial capacity; exact counts unknown until iteration. + all := make([]*metav1.PartialObjectMetadata, 0, 64) for _, rt := range monitoredWorkloadTypes {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lint/checks/workloads/kueue/check.go` around lines 186 - 217, The slice variable all in the loop grows via repeated append which causes extra allocations; before appending each batch of items returned by r.ListMetadata (the items variable inside the monitoredWorkloadTypes loop), ensure capacity is pre-reserved by checking if cap(all) < len(all)+len(items) and, if so, allocate a new slice with the larger capacity and copy the existing elements into it so subsequent appends don't reallocate; adjust this logic around the loop that constructs metav1.PartialObjectMetadata objects (the obj creation block) so you pre-expand all's capacity once per batch rather than letting append repeatedly grow it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/lint/checks/components/kueue/kueue_operator_installed.go`:
- Around line 84-99: Validate() currently returns early when olm.FindOperator()
errors, which prevents emitting the prohibited condition from validateManaged
for managementState=Managed; update Validate() (in OperatorInstalledCheck) to
ensure validateManaged(req, ...) is invoked even if olm.FindOperator() fails (or
call validateManaged before returning on that error), so the prohibited
condition for managementState Managed is always set; reference validateManaged,
Validate, OperatorInstalledCheck, and olm.FindOperator to locate and adjust the
control flow accordingly.
---
Nitpick comments:
In `@pkg/lint/checks/workloads/kueue/check.go`:
- Around line 306-317: buildCRDFQNLookup currently reconstructs the map on every
Validate call even though monitoredWorkloadTypes is static; change this to
compute the map once (either at package init or lazily) by introducing
package-level variables (e.g., crdfqnLookup and crdfqnLookupOnce) and a getter
like getCRDFQNLookup that uses sync.Once to populate crdfqnLookup from
monitoredWorkloadTypes, then update callers (e.g., Validate) to call
getCRDFQNLookup instead of buildCRDFQNLookup or replace buildCRDFQNLookup with
this lazy initializer so the map is not rebuilt repeatedly.
- Around line 157-176: Current loop uses early continues after appending a
violation from checkNamespaceToWorkload or checkWorkloadToNamespace, which
prevents checkOwnerTreeConsistency from running for the same CR; change the loop
so all three checks run for each cr and each non-nil result is appended to
violations (i.e., call checkNamespaceToWorkload(cr, kueueNamespaces),
checkWorkloadToNamespace(cr, kueueNamespaces) and checkOwnerTreeConsistency(cr,
graph) without using continue after the first append) so multiple violations per
CR are collected.
- Around line 229-264: The uniqueResources function currently deduplicates
violations by resource identity and keeps the first violation's Message for each
resource; explicitly document this coupling to checkNamespace so future changes
don't change semantics. Update the comment/docstring for uniqueResources (and
optionally add a short note on checkNamespace) to state that when multiple
violations exist for the same resource the first occurrence wins and its Message
is retained, referencing the types and symbols involved (uniqueResources,
checkNamespace, violation, impactedResource) so maintainers understand the
intended behavior and won't unknowingly reorder or filter violations.
- Around line 186-217: The slice variable all in the loop grows via repeated
append which causes extra allocations; before appending each batch of items
returned by r.ListMetadata (the items variable inside the monitoredWorkloadTypes
loop), ensure capacity is pre-reserved by checking if cap(all) <
len(all)+len(items) and, if so, allocate a new slice with the larger capacity
and copy the existing elements into it so subsequent appends don't reallocate;
adjust this logic around the loop that constructs metav1.PartialObjectMetadata
objects (the obj creation block) so you pre-expand all's capacity once per batch
rather than letting append repeatedly grow it.
In `@pkg/lint/command_options.go`:
- Around line 319-332: The meetsMinSeverity function currently falls through to
return true for any unhandled SeverityLevel, which is unsafe; update the switch
in meetsMinSeverity to handle all known SeverityLevel cases and change the
default/fallback to return false (fail-closed) or explicitly log/panic on an
unexpected SeverityLevel value so unknown levels do not silently pass; reference
the SeverityLevel enum and result.Impact constants (e.g.,
SeverityLevelProhibited, SeverityLevelCritical, SeverityLevelWarning,
SeverityLevelInfo) when updating the switch to ensure exhaustive handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 778554f0-3c23-4f6f-a463-486dc880370f
📒 Files selected for processing (39)
pkg/lint/check/result/result.gopkg/lint/check/result/result_test.gopkg/lint/check/verbose_formatter.gopkg/lint/check/verbose_formatter_test.gopkg/lint/checks/components/kueue/kueue_operator_installed.gopkg/lint/checks/components/kueue/kueue_operator_installed_test.gopkg/lint/checks/workloads/kueue/check.gopkg/lint/checks/workloads/kueue/check_test.gopkg/lint/checks/workloads/kueue/graph.gopkg/lint/checks/workloads/kueue/invariants.gopkg/lint/checks/workloads/kueue/kueue_labels_isvc.gopkg/lint/checks/workloads/kueue/kueue_labels_isvc_test.gopkg/lint/checks/workloads/kueue/kueue_labels_llmisvc.gopkg/lint/checks/workloads/kueue/kueue_labels_llmisvc_test.gopkg/lint/checks/workloads/kueue/kueue_labels_notebook.gopkg/lint/checks/workloads/kueue/kueue_labels_notebook_test.gopkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob.gopkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob_test.gopkg/lint/checks/workloads/kueue/kueue_labels_raycluster.gopkg/lint/checks/workloads/kueue/kueue_labels_raycluster_test.gopkg/lint/checks/workloads/kueue/kueue_labels_rayjob.gopkg/lint/checks/workloads/kueue/kueue_labels_rayjob_test.gopkg/lint/checks/workloads/kueue/support.gopkg/lint/checks/workloads/notebook/accelerator_migration.gopkg/lint/checks/workloads/notebook/connection_integrity.gopkg/lint/checks/workloads/notebook/container_name.gopkg/lint/checks/workloads/notebook/hardware_profile_integrity.gopkg/lint/checks/workloads/notebook/hardwareprofile_migration.gopkg/lint/checks/workloads/notebook/impacted.gopkg/lint/checks/workloads/notebook/running_workloads.gopkg/lint/checks/workloads/notebook/verbose_formatter.gopkg/lint/command.gopkg/lint/command_options.gopkg/lint/command_options_test.gopkg/lint/constants.gopkg/lint/output_table.gopkg/lint/output_table_test.gopkg/resources/types.gopkg/util/kube/resources.go
💤 Files with no reviewable changes (13)
- pkg/lint/checks/workloads/notebook/verbose_formatter.go
- pkg/lint/checks/workloads/kueue/kueue_labels_raycluster.go
- pkg/lint/checks/workloads/kueue/kueue_labels_notebook.go
- pkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob.go
- pkg/lint/checks/workloads/kueue/kueue_labels_rayjob_test.go
- pkg/lint/checks/workloads/kueue/kueue_labels_isvc_test.go
- pkg/lint/checks/workloads/kueue/kueue_labels_pytorchjob_test.go
- pkg/lint/checks/workloads/kueue/kueue_labels_llmisvc_test.go
- pkg/lint/checks/workloads/kueue/kueue_labels_notebook_test.go
- pkg/lint/checks/workloads/kueue/kueue_labels_rayjob.go
- pkg/lint/checks/workloads/kueue/kueue_labels_isvc.go
- pkg/lint/checks/workloads/kueue/kueue_labels_raycluster_test.go
- pkg/lint/checks/workloads/kueue/kueue_labels_llmisvc.go
Replace 6 per-resource-type kueue label checks (Notebook, ISVC,
LLMISVC, RayCluster, RayJob, PyTorchJob) with a single cluster-wide
DataIntegrityCheck that validates three invariants:
1. Every workload in a kueue-managed namespace has the queue-name label
2. Every workload with the queue-name label is in a kueue-managed namespace
3. Within each top-level CR's ownership tree, all resources agree on the
queue-name label value
Key design decisions:
- Top-down traversal from monitored CRs (Notebook, InferenceService,
LLMInferenceService, RayCluster, RayJob, PyTorchJob) through an
ownership graph of intermediate types (Deployment, StatefulSet,
ReplicaSet, DaemonSet, Job, CronJob, Pod)
- Ownership graph built once per namespace and reused across all CRs
- Pre-filter to only process kueue-relevant namespaces (union of
kueue-managed namespaces and namespaces with kueue-labeled workloads)
- Single KueueConsistency condition per run with first-violation-wins
per CR to avoid redundant diagnostics
Introduces ImpactProhibited as the highest severity level ("prohibited"),
representing conditions where an upgrade MUST NOT proceed:
- Kueue DataIntegrityCheck emits prohibited impact on label violations
- Kueue OperatorInstalledCheck escalated to prohibited when Kueue
managementState is Managed — there is no known, reliable upgrade
path from embedded Kueue; migration to the Red Hat build of Kueue
operator is required before upgrading regardless of whether RHBoK
is currently installed. The previous behavior distinguished between
Managed+operator-present (blocking) and Managed+operator-absent (pass);
both cases now emit prohibited unconditionally
Prohibited severity support across the output layer:
- SeverityLevelProhibited filter level for --severity flag
- Prohibited banner (box-drawn border) rendered above the summary
table when any prohibiteconditions are detected
- Double exclamation mark (‼) status icon for prohibited conditions
- PROHIBITED verdict in output (replaces FAIL for this severity)
- Summary line now includes Prohibited count
- Exit error message distinguishes prohibited from blocking
Refactors NotebookVerboseFormatter into EnhancedVerboseFormatter
in pkg/lint/check/:
- Handles mixed resource types by deriving per-object CRD FQN from
TypeMeta via per-object AnnotationObjectCRDName, with result-level
annotation preference for single-kind results
- Exports CRDFullyQualifiedName and DeriveCRDFQNFromTypeMeta helpers
- Supports optional per-object context via AnnotationObjectContext,
rendered as a sub-bullet beneath each impacted object reference
- All notebook checks updated to embed check.EnhancedVerboseFormatter
(drop-in replacement, no behavioral change)
Adds resource type definitions for StatefulSet, ReplicaSet, DaemonSet,
Job, and CronJob. Updates ToPartialObjectMetadata helper to preserve
UID and OwnerReferences need for ownership graph construction.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Andy Stoneberg <astonebe@redhat.com>
5934ee6 to
e421b6e
Compare
Description
Replace 6 per-resource-type kueue label checks (Notebook, ISVC,
LLMISVC, RayCluster, RayJob, PyTorchJob) with a single cluster-wide
DataIntegrityCheck that validates three invariants:
queue-name label value
Key design decisions:
LLMInferenceService, RayCluster, RayJob, PyTorchJob) through an
ownership graph of intermediate types (Deployment, StatefulSet,
ReplicaSet, DaemonSet, Job, CronJob, Pod)
kueue-managed namespaces and namespaces with kueue-labeled workloads)
per CR to avoid redundant diagnostics
Introduces ImpactProhibited as the highest severity level ("prohibited"),
representing conditions where an upgrade MUST NOT proceed:
managementState is Managed — there is no known, reliable upgrade
path from embedded Kueue; migration to the Red Hat build of Kueue
operator is required before upgrading regardless of whether RHBoK
is currently installed. The previous behavior distinguished between
Managed+operator-present (blocking) and Managed+operator-absent (pass);
both cases now emit prohibited unconditionally
Prohibited severity support across the output layer:
table when any prohibiteconditions are detected
Refactors NotebookVerboseFormatter into EnhancedVerboseFormatter
in pkg/lint/check/:
TypeMeta via per-object AnnotationObjectCRDName, with result-level
annotation preference for single-kind results
rendered as a sub-bullet beneath each impacted object reference
(drop-in replacement, no behavioral change)
Adds resource type definitions for StatefulSet, ReplicaSet, DaemonSet,
Job, and CronJob. Updates ToPartialObjectMetadata helper to preserve
UID and OwnerReferences need for ownership graph construction.
How Has This Been Tested?
Kueue Unmanaged with "inconsistent" Kueue state in environment
queue-namelabel) can be tricky to reproduce in a manner that doesn't fail in invariant 1 or invariant 2 - and as such isn't included in the above sample data - but was manually crafted on the cluster➜ odh-cli/ git:(feat/agressive-kueue-gating) $ ./bin/kubectl-odh lint --target-version 3.3 --verboseLook for
prohibitedseverity in outputTOP LEVEL BANNER
SUMMARY TABLE
VERBOSE OUTPUT TABLE
RESULT OUTPUT
Kueue Managed
https://console-openshift-console.apps.prod.rhoai.rh-aiservices-bu.com➜ odh-cli/ git:(feat/agressive-kueue-gating) $ ./bin/kubectl-odh lint --target-version 3.3 --verbose --debugLook for
prohibitedseverity in outputSUMMARY TABLE
Kueue Removed (and/or not present)
➜ odh-cli/ git:(feat/agressive-kueue-gating) $ ./bin/kubectl-odh lint --target-version 3.3 --verboseSUMMARY TABLE
VERBOSE TABLE
RESULT OUTPUT
Merge criteria:
Summary by CodeRabbit
New Features
Bug Fixes