Skip to content

Conditions/operator config#1011

Open
SchSeba wants to merge 8 commits intok8snetworkplumbingwg:masterfrom
SchSeba:conditions/operator-config
Open

Conditions/operator config#1011
SchSeba wants to merge 8 commits intok8snetworkplumbingwg:masterfrom
SchSeba:conditions/operator-config

Conversation

@SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Jan 11, 2026

No description provided.

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@gemini-code-assist
Copy link

Summary of Changes

Hello @SchSeba, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the user experience and automation capabilities of the SR-IOV Network Operator by adopting standard Kubernetes conditions for its SriovOperatorConfig CRD. It introduces a structured way to report the health, progress, and degraded states of the operator's components, making it easier for users to monitor and troubleshoot. The changes include new API definitions, a dedicated status management utility, and comprehensive testing to ensure the robustness of the new condition-based status reporting.

Highlights

  • Standardized Kubernetes Conditions: Introduces Kubernetes metav1.Condition to SriovOperatorConfigStatus and defines common condition types (Ready, Progressing, Degraded, DrainProgressing, DrainDegraded, DrainComplete) and reasons for consistent status reporting across SR-IOV CRDs.
  • Enhanced Observability: Adds kubectl get printer columns for Ready, Progressing, and Degraded conditions to SriovOperatorConfig, improving at-a-glance status visibility.
  • Robust Status Management Utility: Implements a new pkg/status package with a Patcher utility for reliable status updates, including retry logic for conflicts and automatic event emission on condition transitions.
  • Comprehensive Design Document: Includes a detailed design document (doc/design/kubernetes-conditions-integration.md) explaining the motivation, goals, implementation, and usage of the new conditions framework.
  • Improved Test Coverage: Adds extensive unit and integration tests for the new condition handling logic and status patcher, ensuring reliability and correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by adding standard Kubernetes conditions to the SriovOperatorConfig CRD, which greatly enhances observability. The addition of a dedicated pkg/status for handling status updates with retries and event emission is a solid architectural choice. The new tests for conditions are comprehensive. However, I've identified a few issues: a potential regression due to the removal of policy sorting logic, a minor bug in how object generation is used for conditions, and a design flaw in the new status patcher that could lead to incorrect event generation under race conditions. Please see my detailed comments for suggestions on how to address these points.

I am having trouble creating individual review comments. Click here to see my feedback.

controllers/sriovoperatorconfig_controller.go (118-129)

high

The logic for fetching and sorting SriovNetworkNodePolicyList has been removed. The comment on this code indicates that sorting is crucial to ensure a stable node affinity for the sriov-device-plugin and to prevent unnecessary pod recreations. Removing this logic without a replacement could introduce the instability that it was designed to prevent. Was this logic moved elsewhere, or is it no longer needed? If it's still required, it should be restored or its new location clarified.

controllers/sriovoperatorconfig_controller.go (608-621)

medium

The ObservedGeneration for the conditions is being set using config.GetGeneration(). However, config is the object fetched at the beginning of the Reconcile loop. To avoid race conditions and ensure the generation matches the object version being updated, you should use the generation from updatedObj, which is fetched right before the conditions are updated.

		// Operator components are having issues
		r.StatusPatcher.SetCondition(&newConditions, sriovnetworkv1.ConditionReady, metav1.ConditionFalse,
			sriovnetworkv1.ReasonOperatorComponentsNotHealthy, fmt.Sprintf("Operator reconciliation failed: %v", reconcileErr),
			updatedObj.GetGeneration())
		r.StatusPatcher.SetCondition(&newConditions, sriovnetworkv1.ConditionDegraded, metav1.ConditionTrue,
			sriovnetworkv1.ReasonOperatorComponentsNotHealthy, fmt.Sprintf("Operator components failed to reconcile: %v", reconcileErr),
			updatedObj.GetGeneration())
	} else {
		// Operator is ready and healthy
		r.StatusPatcher.SetCondition(&newConditions, sriovnetworkv1.ConditionReady, metav1.ConditionTrue,
			sriovnetworkv1.ReasonOperatorReady, "Operator components are running and healthy",
			updatedObj.GetGeneration())
		r.StatusPatcher.SetCondition(&newConditions, sriovnetworkv1.ConditionDegraded, metav1.ConditionFalse,
			sriovnetworkv1.ReasonNotDegraded, "Operator is functioning correctly",
			updatedObj.GetGeneration())

pkg/status/patcher.go (146)

medium

The UpdateStatusWithEvents function's design has a flaw regarding event generation during retries. The oldConditions parameter is captured before the retry loop. If a conflict occurs and the object is refreshed, oldConditions becomes stale. When DetectTransitions is called after a successful patch on a retry, it compares the stale oldConditions with newConditions, which can lead to incorrect or missed events.

To fix this, the oldConditions should be determined within the retry loop from the object just before it's modified. I suggest changing the signature of UpdateStatusWithEvents to have the updateFunc return the old conditions.

For example, you could change the signature to:
UpdateStatusWithEvents(ctx context.Context, obj client.Object, newConditions []metav1.Condition, updateFunc func() (oldConditions []metav1.Condition, err error)) error

And then inside the function:

// ... in for loop
oldConditions, err := updateFunc()
if err != nil {
    return fmt.Errorf("failed to apply status update: %w", err)
}
// ... after successful patch
transitions := DetectTransitions(oldConditions, newConditions)
// ...

The call site in sriovoperatorconfig_controller.go would then be adjusted to:

err := r.StatusPatcher.UpdateStatusWithEvents(ctx, updatedObj, newConditions, func() ([]metav1.Condition, error) {
    old := updatedObj.Status.Conditions
    updatedObj.Status.Conditions = newConditions
    return old, nil
})

This ensures that for each patch attempt, you are comparing against the correct "old" state for event generation.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Add foundational condition support for SR-IOV Network Operator CRDs:

- Define standard condition types (Ready, Progressing, Degraded)
- Define drain-specific condition types (DrainProgressing, DrainDegraded, DrainComplete)
- Add DrainState enum for tracking drain operation states
- Add common condition reasons for various states
- Add NetworkStatus type with Conditions field for network CRDs
- Add ConditionsEqual helper for comparing conditions ignoring LastTransitionTime
- Add SetConfigurationConditions and SetDrainConditions methods
- Add comprehensive unit tests for condition handling

This follows the Kubernetes API conventions for status conditions
and provides the foundation for observability improvements across
all SR-IOV CRDs.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@SchSeba SchSeba force-pushed the conditions/operator-config branch from 6b1a6da to 69affb1 Compare February 18, 2026 09:04
@coveralls
Copy link

coveralls commented Feb 18, 2026

Pull Request Test Coverage Report for Build 22136514631

Details

  • 189 of 287 (65.85%) changed or added relevant lines in 5 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.1%) to 62.991%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/status/transitions.go 55 59 93.22%
api/v1/zz_generated.deepcopy.go 7 22 31.82%
controllers/sriovoperatorconfig_controller.go 49 83 59.04%
pkg/status/patcher.go 58 103 56.31%
Files with Coverage Reduction New Missed Lines %
controllers/sriovoperatorconfig_controller.go 1 72.46%
pkg/daemon/status.go 5 70.83%
Totals Coverage Status
Change from base Build 22102567607: 0.1%
Covered Lines: 9484
Relevant Lines: 15056

💛 - Coveralls

Add a new status package with utilities for managing CRD status updates:

Patcher (patcher.go):
- Provides retry logic for status updates with conflict handling
- Supports both Update and Patch operations
- Includes UpdateStatusWithEvents for automatic event emission
- Embeds condition management methods for convenience
- Uses interface for easy mocking in tests

Transitions (transitions.go):
- DetectTransitions compares old and new conditions
- Returns structured Transition objects for each change
- EventType() method returns appropriate event type (Normal/Warning)
- EventReason() generates suitable event reason strings
- Supports Added, Changed, Removed, and Unchanged transitions

Tests:
- Comprehensive unit tests for patcher operations
- Tests for transition detection logic
- Tests for event type and reason generation

This package provides a reusable foundation for consistent status
handling across all controllers in the operator.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Update SriovOperatorConfig status type to include Kubernetes conditions:

- Add Conditions field to SriovOperatorConfigStatus
- Add kubebuilder printcolumns for Ready, Progressing, Degraded status

Conditions follow Kubernetes conventions with patchMergeKey
and listType annotations for proper strategic merge patch support.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Integrate condition management into SriovOperatorConfig controller:

Controller Changes:
- Add StatusPatcher dependency for condition management
- Add updateConditions method to set Ready/Degraded conditions
- Set Ready=True, Degraded=False on successful reconciliation
- Set Ready=False, Degraded=True on reconciliation errors
- Update conditions on webhook sync failures
- Update conditions on config daemon sync failures
- Update conditions on plugin daemon sync failures
- Include error details in condition messages

Tests:
- Add test for Ready=True when operator reconciles successfully
- Add test for Ready=False, Degraded=True on errors
- Add test for observedGeneration consistency
- Add test for condition presence validation
- Add test for error message inclusion in conditions
- Add dedicated test suite for degraded conditions

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Update the kubernetes-conditions-integration.md design document to
reflect the actual implementation details:

- Add SriovNetworkNodePolicy as a supported CRD
- Add drain-specific conditions (DrainProgressing, DrainDegraded, DrainComplete)
- Document all condition reasons with actual constant names
- Add NetworkStatus shared type used by network CRDs
- Document pkg/status package with Patcher interface and transition detection
- Add ConditionsEqual helper function documentation
- Update API extension examples with actual struct definitions
- Add matchedNodeCount/readyNodeCount to Policy and PoolConfig status
- Add kubectl output examples showing new columns
- Update implementation commits list
- Mark document status as 'implemented'

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
- Remove SriovNetworkNodeState tests from conditions_test.go
- Add findCondition helper to suite_test.go
- Regenerate CRDs and deepcopy for SriovOperatorConfig type

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
Add cleanup logic to remove any existing SriovOperatorConfig before
creating a new one in the SriovNetworkNodePolicy controller tests.

This fixes a flaky test failure where the SriovOperatorConfig controller
tests may leave a 'default' config behind that conflicts with the
SriovNetworkNodePolicy tests when they run with different random seeds.

The cleanup logic mirrors what's already in sriovoperatorconfig_controller_test.go:
- Remove finalizers if present
- Delete the object if not already deleted
- Wait for the object to be fully gone before creating a new one

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@SchSeba SchSeba force-pushed the conditions/operator-config branch from 69affb1 to 3bf3c02 Compare February 18, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants