Skip to content

Add Kubernetes Conditions to SR-IOV Network Operator CRDs#1007

Closed
SchSeba wants to merge 14 commits intok8snetworkplumbingwg:masterfrom
SchSeba:design_conditions_implementation
Closed

Add Kubernetes Conditions to SR-IOV Network Operator CRDs#1007
SchSeba wants to merge 14 commits intok8snetworkplumbingwg:masterfrom
SchSeba:design_conditions_implementation

Conversation

@SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Jan 8, 2026

This PR enhances the observability and operational transparency of the SR-IOV Network Operator by integrating standard Kubernetes conditions into the status of all key Custom Resource Definitions (CRDs). This enables users and automated systems to easily understand the current state, progress, and health of SR-IOV network configurations directly through Kubernetes API objects.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@gemini-code-assist
Copy link

Summary of Changes

Hello @SchSeba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the observability and operational transparency of the SR-IOV Network Operator by integrating standard Kubernetes conditions into the status of its Custom Resource Definitions. This change allows users and automated systems to easily understand the current state, progress, and health of SR-IOV network configurations and operator components directly through Kubernetes API objects, facilitating better monitoring, debugging, and automated remediation workflows.

Highlights

  • Kubernetes Conditions Integration: Introduced standard Kubernetes conditions (Ready, Progressing, Degraded) to the status of all major SR-IOV Network Operator Custom Resource Definitions (CRDs): SriovNetwork, SriovIBNetwork, OVSNetwork, SriovNetworkNodeState, SriovOperatorConfig, SriovNetworkNodePolicy, and SriovNetworkPoolConfig. This enhances observability and operational transparency.
  • New Status Controllers: Added dedicated status controllers (SriovNetworkNodePolicyStatusReconciler and SriovNetworkPoolConfigStatusReconciler) responsible for aggregating conditions from individual SriovNetworkNodeState objects to provide a consolidated status for policies and pool configurations.
  • Generic Status Patcher: Implemented a new pkg/status package providing a generic Patcher interface and implementation. This utility simplifies status updates with built-in retry logic for conflicts and automatic event emission for condition transitions, ensuring consistency and robustness across controllers.
  • Enhanced kubectl get Output: Added +kubebuilder:printcolumn annotations to CRD definitions, allowing users to quickly view the Ready, Progressing, and Degraded statuses directly from kubectl get commands for all affected CRDs.
  • Improved Node Drain Observability: Updated the node drain logic to include specific conditions (DrainProgressing, DrainDegraded, DrainComplete) in the SriovNetworkNodeState status, providing clearer insight into the node draining process and any encountered errors.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent pull request that significantly enhances the observability of the SR-IOV Network Operator by integrating standard Kubernetes conditions into all key CRDs. The changes are comprehensive, well-structured, and follow Kubernetes API conventions. The introduction of a status.Patcher utility for handling status updates with retries and event emissions is a great abstraction. The addition of additionalPrinterColumns is a valuable UX improvement for kubectl users. The test coverage is thorough, including unit, integration, and conformance tests, which gives high confidence in the changes. I've also noted a subtle but important bug fix in the drain controller's predicate logic. Overall, this is a high-quality contribution.

Comment on lines 122 to 123
oldAnno, hasOldAnno := e.ObjectOld.GetAnnotations()[constants.NodeStateDrainAnnotationCurrent]
newAnno, hasNewAnno := e.ObjectNew.GetAnnotations()[constants.NodeStateDrainAnnotationCurrent]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a great catch! Using GetAnnotations() here instead of GetLabels() fixes a bug where the DrainStateAnnotationPredicate would not have correctly detected updates to the sriovnetwork.openshift.io/state annotation. This ensures the drain controller will be properly triggered on state changes.

@SchSeba SchSeba force-pushed the design_conditions_implementation branch 5 times, most recently from 65fd6d8 to 311efde Compare January 8, 2026 15:17
@coveralls
Copy link

coveralls commented Jan 8, 2026

Pull Request Test Coverage Report for Build 20831826835

Details

  • 857 of 1087 (78.84%) changed or added relevant lines in 15 files are covered.
  • 27 unchanged lines in 7 files lost coverage.
  • Overall coverage increased (+1.1%) to 64.352%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/drain/drainer.go 25 27 92.59%
controllers/drain_controller.go 3 6 50.0%
pkg/status/transitions.go 56 59 94.92%
api/v1/zz_generated.deepcopy.go 41 48 85.42%
api/v1/helper.go 206 218 94.5%
controllers/drain_controller_helper.go 38 62 61.29%
controllers/sriovoperatorconfig_controller.go 49 83 59.04%
controllers/generic_network_controller.go 68 111 61.26%
pkg/status/patcher.go 58 103 56.31%
controllers/status_controller.go 272 329 82.67%
Files with Coverage Reduction New Missed Lines %
controllers/sriovoperatorconfig_controller.go 1 72.46%
controllers/helper.go 2 69.85%
pkg/utils/cluster.go 3 84.54%
controllers/drain_controller.go 4 77.69%
controllers/generic_network_controller.go 4 72.73%
controllers/sriovnetworkpoolconfig_controller.go 4 64.71%
controllers/drain_controller_helper.go 9 62.68%
Totals Coverage Status
Change from base Build 20823552175: 1.1%
Covered Lines: 9963
Relevant Lines: 15482

💛 - Coveralls

@SchSeba SchSeba force-pushed the design_conditions_implementation branch from b586a1f to e3f854f Compare January 8, 2026 18:07
@github-actions github-actions bot added the ci label Jan 8, 2026
@SchSeba SchSeba force-pushed the design_conditions_implementation branch from adcc301 to 7afe109 Compare January 8, 2026 18:14
@SchSeba
Copy link
Collaborator Author

SchSeba commented Jan 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive Kubernetes condition integration across all major SR-IOV Network Operator Custom Resource Definitions (CRDs), including SriovNetwork, SriovIBNetwork, OVSNetwork, SriovNetworkNodeState, SriovOperatorConfig, SriovNetworkNodePolicy, and SriovNetworkPoolConfig. The changes define common condition types (Ready, Progressing, Degraded) and specific drain-related conditions (DrainProgressing, DrainDegraded, DrainComplete) along with detailed reasons. Each CRD's status now includes a Conditions array, and kubectl get -o wide output is enhanced with new columns to display these conditions. A new pkg/status package provides a Patcher interface for managing status updates with retry logic and event emission, ensuring conditions are updated consistently and atomically. Controllers are updated to utilize this StatusPatcher to set appropriate conditions based on reconciliation outcomes, node states, and drain operations. Specifically, the drain controller now uses an onError callback to immediately update drain conditions with specific error messages, making the drain process more robust. Review comments suggest further enhancing the drain process by adding retry logic to the node cordon operation, similar to the previous exponential backoff approach, to improve resilience against transient failures during cordoning.

Comment on lines +92 to 98
if err = drain.RunCordonOrUncordon(drainHelper, node, true); err != nil {
reqLogger.Error(err, "drainNode(): Cordon failed")
if onError != nil {
onError(err)
}
reqLogger.Info("drainNode(): failed to drain node", "error", err)
return false, err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The drain.RunCordonOrUncordon call for cordoning the node does not include retry logic. While drain.RunNodeDrain has its own retry mechanism, a transient error during the initial cordon operation could prematurely fail the entire drain process. Adding a retry mechanism (similar to the previous wait.ExponentialBackoffWithContext approach) around the cordon operation would make the drain process more robust against transient API server issues or other temporary failures during cordoning.

    // Cordon the node first
    if err = wait.ExponentialBackoffWithContext(ctx, wait.Backoff{
        Steps:    3,
        Duration: 2 * time.Second,
        Factor:   2,
    }, func(ctx context.Context) (bool, error) {
        if err := drain.RunCordonOrUncordon(drainHelper, node, true); err != nil {
            reqLogger.Info("drainNode(): Cordon failed, retrying", "error", err)
            if onError != nil {
                onError(err)
            }
            return false, nil
        }
        return true, nil
    }); err != nil {
        reqLogger.Error(err, "drainNode(): Cordon failed after retries")
        return false, err
    }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not implementing this one better to just return error and requeue

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Add foundational condition support for SR-IOV Network Operator CRDs:

- Define standard condition types (Ready, Progressing, Degraded)
- Define drain-specific condition types (DrainProgressing, DrainDegraded, DrainComplete)
- Add DrainState enum for tracking drain operation states
- Add common condition reasons for various states
- Add NetworkStatus type with Conditions field for network CRDs
- Add ConditionsEqual helper for comparing conditions ignoring LastTransitionTime
- Add SetConfigurationConditions and SetDrainConditions methods
- Add comprehensive unit tests for condition handling

This follows the Kubernetes API conventions for status conditions
and provides the foundation for observability improvements across
all SR-IOV CRDs.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Update all SR-IOV CRD status types to include Kubernetes conditions:

Network CRDs (SriovNetwork, SriovIBNetwork, OVSNetwork):
- Embed NetworkStatus with Conditions field
- Add kubebuilder printcolumns for Ready and Degraded status

SriovNetworkNodeState:
- Add Conditions field to status
- Add printcolumns for Ready, Progressing, Degraded
- Add printcolumns for drain conditions (DrainProgress, DrainDegraded, DrainComplete)

SriovOperatorConfig:
- Add Conditions field to status
- Add printcolumns for Ready, Progressing, Degraded

SriovNetworkNodePolicy:
- Add Conditions field to status
- Add MatchedNodeCount and ReadyNodeCount fields
- Add printcolumns for Matched, Ready Nodes, and conditions

SriovNetworkPoolConfig:
- Add Conditions field to status
- Add MatchedNodeCount and ReadyNodeCount fields
- Add printcolumns for Matched, Ready Nodes, and conditions

All conditions follow Kubernetes conventions with patchMergeKey
and listType annotations for proper strategic merge patch support.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@SchSeba SchSeba force-pushed the design_conditions_implementation branch 4 times, most recently from f6c33f5 to a9b9b1b Compare January 8, 2026 20:02
@SchSeba
Copy link
Collaborator Author

SchSeba commented Jan 8, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial and valuable pull request that introduces standard Kubernetes conditions across the SR-IOV Network Operator's CRDs. This change significantly enhances observability and aligns the operator with Kubernetes best practices. The implementation is well-structured, introducing a dedicated status patcher, new status controllers for aggregated status, and comprehensive test coverage. My feedback focuses on improving the efficiency of status updates, ensuring consistency in how condition changes are detected, and increasing the robustness of the drain error handling.

Add a new status package with utilities for managing CRD status updates:

Patcher (patcher.go):
- Provides retry logic for status updates with conflict handling
- Supports both Update and Patch operations
- Includes UpdateStatusWithEvents for automatic event emission
- Embeds condition management methods for convenience
- Uses interface for easy mocking in tests

Transitions (transitions.go):
- DetectTransitions compares old and new conditions
- Returns structured Transition objects for each change
- EventType() method returns appropriate event type (Normal/Warning)
- EventReason() generates suitable event reason strings
- Supports Added, Changed, Removed, and Unchanged transitions

Tests:
- Comprehensive unit tests for patcher operations
- Tests for transition detection logic
- Tests for event type and reason generation

This package provides a reusable foundation for consistent status
handling across all controllers in the operator.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Integrate condition management into network CRD controllers:

Generic Network Controller:
- Add StatusPatcher dependency for condition management
- Add updateConditions method to set Ready/Degraded conditions
- Set Ready=True, Degraded=False on successful NAD provisioning
- Set Ready=False, Degraded=True when namespace not found
- Set Ready=False, Degraded=True on provisioning failures
- Handle ownership conflicts with appropriate conditions

SriovNetwork, SriovIBNetwork, OVSNetwork Controllers:
- Add StatusPatcher field to each controller
- Pass StatusPatcher to generic reconciler
- Add Get/SetConditions methods to helper.go for each type

Tests:
- Add condition tests for all three network types
- Test Ready/Degraded states for successful provisioning
- Test degraded states for missing namespaces
- Test observedGeneration consistency
- Test ownership conflict detection
- Test cross-namespace targeting restrictions

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Integrate condition management into SriovOperatorConfig controller:

Controller Changes:
- Add StatusPatcher dependency for condition management
- Add updateConditions method to set Ready/Degraded conditions
- Set Ready=True, Degraded=False on successful reconciliation
- Set Ready=False, Degraded=True on reconciliation errors
- Update conditions on webhook sync failures
- Update conditions on config daemon sync failures
- Update conditions on plugin daemon sync failures
- Include error details in condition messages

Tests:
- Add test for Ready=True when operator reconciles successfully
- Add test for Ready=False, Degraded=True on errors
- Add test for observedGeneration consistency
- Add test for condition presence validation
- Add test for error message inclusion in conditions
- Add dedicated test suite for degraded conditions

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Update config daemon status handling to set configuration conditions:

- Call SetConfigurationConditions when updating node state sync status
- Preserve drain conditions during configuration status updates
- Use ConditionsEqual to avoid unnecessary API calls when unchanged
- Set Ready=True, Progressing=False, Degraded=False on success
- Set Progressing=True, Ready=False on in-progress configuration
- Set Degraded=True with error message on configuration failure
- Keep Degraded=True while retrying after previous failure

This ensures the daemon properly reports configuration progress
and errors through standard Kubernetes conditions, enabling
better observability and automation.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Add comprehensive drain condition management to the drain controller:

Drain Controller Changes:
- Add updateDrainConditions helper to set drain-specific conditions
- Set DrainProgressing=True when drain starts
- Set DrainDegraded=True when drain errors occur (e.g., PDB violations)
- Set DrainComplete=True when drain completes successfully
- Reset all drain conditions to idle state when returning to idle
- Use Patch instead of Update to avoid race conditions with daemon

Drainer Changes:
- Add DrainErrorCallback type for immediate error notification
- Call callback when eviction errors occur (before timeout)
- Capture first real error (e.g., PDB violation) instead of generic timeout
- Pass callback through cordon and drain operations

Controller Helper Fix:
- Fix DrainStateAnnotationPredicate to use GetAnnotations() not GetLabels()

Tests:
- Add tests for DrainComplete=True after successful drain
- Add tests for DrainDegraded=False after successful drain
- Add tests for idle state conditions
- Add tests for observedGeneration consistency
- Add tests for single node reboot drain conditions
- Add tests for error callback invocation

This enables users to observe drain progress and errors through
standard Kubernetes conditions, improving troubleshooting and
enabling automated responses to drain failures.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
…tion

Add dedicated status reconcilers for aggregating node conditions:

SriovNetworkNodePolicyStatusReconciler:
- Watches SriovNetworkNodePolicy, SriovNetworkNodeState, and Node objects
- Aggregates conditions from matching nodes based on nodeSelector
- Updates MatchedNodeCount and ReadyNodeCount in status
- Sets Ready=True only when ALL matched nodes are ready
- Sets Progressing=True when ANY node is progressing
- Sets Degraded=True when ANY node is degraded
- Sets NoMatchingNodes reason when selector matches nothing

SriovNetworkPoolConfigStatusReconciler:
- Similar aggregation logic for SriovNetworkPoolConfig
- Supports nil nodeSelector (matches all nodes)
- Skips OVS hardware offload configs (different purpose)
- Uses label selector for node matching

Shared Helper Functions:
- aggregateNodeConditions: collects condition counts from NodeStates
- buildStatusConditions: creates conditions from aggregated counts
- isConditionTrue: helper for checking condition status

Controller Registration:
- Register both status controllers in main.go
- Add findCondition helper to suite_test.go for test assertions

This enables users to see at-a-glance whether their policies
and pool configs are fully applied across all matching nodes.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Update generated files and deployment manifests:

DeepCopy (zz_generated.deepcopy.go):
- Add DeepCopyInto for NetworkStatus with Conditions
- Update DeepCopyInto for all Status types with Conditions
- Handle proper slice copying for Condition fields

CRD Manifests (config/crd/bases/):
- Add conditions schema to all CRD status types
- Add additionalPrinterColumns for condition status display
- Add matchedNodeCount/readyNodeCount to policy and poolconfig CRDs
- Conditions use listType=map with listMapKey=type

Helm Chart CRDs (deployment/sriov-network-operator-chart/crds/):
- Mirror changes from config/crd/bases/

RBAC (deploy/, deployment/sriov-network-operator-chart/templates/):
- Add explicit status subresource permissions for all CRDs
- Add get, patch, update verbs for */status resources
- Required for status controller to update conditions

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Add condition assertions to conformance tests:

Fixtures:
- Add assertCondition helper for verifying condition status
- Support SriovNetwork, SriovNetworkPoolConfig, SriovNetworkNodePolicy

Network Pool Tests (test_networkpool.go):
- Verify PoolConfig Ready=True after RDMA mode configuration
- Verify PoolConfig Progressing=False after configuration completes
- Verify MatchedNodeCount is updated correctly
- Add test for NoMatchingNodes when selector matches nothing

Policy Configuration Tests (test_policy_configuration.go):
- Verify Policy Ready=True after VF configuration
- Verify Policy Progressing=False after configuration completes
- Add condition checks for MTU and partition policies
- Add condition checks for VFIO and netdevice policies

Operator Tests (test_sriov_operator.go):
- Verify SriovNetwork Ready=True after provisioning
- Verify SriovNetwork Degraded=False when healthy
- Add condition checks for various network configurations
- Add condition validation for trust and spoofChk settings

These tests ensure conditions are correctly set during real
cluster operations and validate the end-to-end observability
improvements.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Update the kubernetes-conditions-integration.md design document to
reflect the actual implementation details:

- Add SriovNetworkNodePolicy as a supported CRD
- Add drain-specific conditions (DrainProgressing, DrainDegraded, DrainComplete)
- Document all condition reasons with actual constant names
- Add NetworkStatus shared type used by network CRDs
- Document pkg/status package with Patcher interface and transition detection
- Add ConditionsEqual helper function documentation
- Update API extension examples with actual struct definitions
- Add matchedNodeCount/readyNodeCount to Policy and PoolConfig status
- Add kubectl output examples showing new columns
- Update implementation commits list
- Mark document status as 'implemented'

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@SchSeba SchSeba force-pushed the design_conditions_implementation branch from a9b9b1b to c9784dd Compare January 8, 2026 21:00
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@SchSeba
Copy link
Collaborator Author

SchSeba commented Jan 14, 2026

To make the review process easier I split this one into separate PRs.
please review them in the following order

Networks-crd (sriovNetwork,sriovIBNetwork,ovsNetwork)
#1010
operatorconfig
#1011
nodeState
#1012
nodePolicy + nodePool (based on nodeState one)
#1013

closing this one

@SchSeba SchSeba closed this Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants