Skip to content

test: wait for operator pods readiness after networkpool configuration#1056

Open
ashokpariya0 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
ashokpariya0:wait-for-operator-pods
Open

test: wait for operator pods readiness after networkpool configuration#1056
ashokpariya0 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
ashokpariya0:wait-for-operator-pods

Conversation

@ashokpariya0
Copy link
Contributor

creating SriovNetworkPoolConfig(

err = clients.Create(context.Background(), networkPool)
) may trigger node reboot during SR-IOV reconfiguration.
after the reboot, SR-IOV operator pods (network-resources-injector, operator-webhook) take some time to restart.
we can below in container creating state after node reboot even after log msg- Sriov state is stable

$oc get all -n openshift-sriov-network-operator
Warning: kubevirt.io/v1 VirtualMachineInstancePresets is now deprecated and will be removed in v2.
NAME READY STATUS RESTARTS AGE
pod/network-resources-injector-nxkm5 0/1 ContainerCreating 15 25h
pod/operator-webhook-c9rtw 0/1 ContainerCreating 16 25h
pod/sriov-network-config-daemon-5lpms 1/1 Running 1 7m59s
pod/sriov-network-operator-bfb9b97cc-qdkj2 1/1 Running 16 25h
pod/testpod-79dmm 1/1 Running 9 21h

The test could proceed before these pods are ready, causing the admission webhook to timeout when creating SriovNetworkNodePolicy.

Add an explicit wait for operator namespace pods to reach the Running/Ready state before continuing the test to avoid this race condition.

NetworkPool conformance tests may fail intermittently after creating
SriovNetworkPoolConfig because the node can reboot as part of the
configuration process.

After the reboot, several SR-IOV operator components restart, including
the sriov-network-operator, sriov-network-config-daemon and device plugin
pods. The test may proceed while these pods are still initializing.

When the test immediately creates a SriovNetworkNodePolicy, the admission
webhook served by the sriov-network-operator may not yet be ready,
resulting in errors such as:

failed calling webhook "operator-webhook.sriovnetwork.openshift.io":
context deadline exceeded

To avoid this race condition, the test now waits for the SR-IOV operator
pods in the operator namespace to reach the Running/Ready state before
continuing with further configuration steps.

This ensures the admission webhook is available and stabilizes the
NetworkPool RDMA tests in environments where node reboot occurs.

we see below errors:


[sriov] NetworkPool Check rdma metrics inside a pod in shared mode not exist should run pod without RDMA cni and not expose nic metrics
/root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:355
Waiting for the sriov state to stable
Sriov state is stable
STEP: Testing on node master-0.ocp-ashok.lnxero1.boe, 2 devices found @ 03/11/26 10:59:52.782
Waiting for the sriov state to stable
Sriov state is stable
STEP: Creating sriov network to use the rdma device @ 03/11/26 11:01:42.853
STEP: waiting for operator to finish the configuration @ 03/11/26 11:01:43.89
Waiting for the sriov state to stable
Sriov state is stable
STEP: creating a policy @ 03/11/26 11:06:08.645
[FAILED] in [It] - /root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:359 @ 03/11/26 11:06:36.928
Waiting for the sriov state to stable

Sriov state is stable
• [FAILED] [720.567 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in shared mode not exist [It] should run pod without RDMA cni and not expose nic metrics
/root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:355

[FAILED] Unexpected error:
<*errors.StatusError | 0xc000453400>:
Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s": context deadline exceeded
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s\": context deadline exceeded",
Reason: "InternalError",
Details: {
Name: "",
Group: "",
Kind: "",
UID: "",
Causes: [
{
Type: "",
Message: "failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s": context deadline exceeded",
Field: "",
},
],
RetryAfterSeconds: 0,
},
Code: 500,
},
}
occurred
In [It] at: /root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:359 @ 03/11/26 11:06:36.928

@github-actions
Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions github-actions bot added the tests label Mar 12, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the reliability of SR-IOV NetworkPool conformance tests. It resolves a race condition where tests would proceed before critical operator components were fully ready after a node reboot, leading to webhook timeouts and test failures. By introducing an explicit waiting mechanism, the tests are now more robust and stable, especially in scenarios involving SR-IOV reconfigurations that trigger node reboots.

Highlights

  • Test Stability: Introduced an explicit wait for SR-IOV operator pods to reach a Running/Ready state before proceeding with test steps, preventing race conditions.
  • Race Condition Mitigation: Addressed intermittent test failures caused by operator webhooks not being ready after node reboots during SriovNetworkPoolConfig creation.
  • Conformance Tests: Stabilized NetworkPool RDMA conformance tests, particularly in environments where node reboots occur during SR-IOV reconfiguration.
Changelog
  • test/conformance/tests/test_networkpool.go
    • Added calls to WaitForOperatorPodsReady() after WaitForSRIOVStable() in two test blocks.
    • Implemented a new function WaitForOperatorPodsReady() to poll for all operator namespace pods to be in Running phase and Ready condition.
Activity
  • No review comments or other activity recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a wait for operator pods to be ready, which is a good addition to stabilize the tests. However, the implementation of the wait function WaitForOperatorPodsReady has a logical flaw in how it determines pod readiness and also doesn't account for an empty list of pods. I've provided a suggestion to make the function more robust.

Comment on lines +379 to +402
func WaitForOperatorPodsReady() {
Eventually(func() bool {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
if err != nil {
return false
}

for _, pod := range podList.Items {
if pod.Status.Phase != corev1.PodRunning {
return false
}

for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status != corev1.ConditionTrue {
return false
}
}
}

return true
}, 2*time.Minute, 5*time.Second).Should(BeTrue())
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of WaitForOperatorPodsReady has a couple of issues that could lead to incorrect behavior:

  1. Flawed Readiness Check: The logic to check if a pod is ready is not robust. It fails to correctly handle the case where a pod is in the Running phase but its PodReady condition is not yet present or is False. A pod should only be considered ready if its PodReady condition is present and has a status of True.
  2. Empty Pod List: The function doesn't handle the case where the pod list is empty. If no pods are found in the operator namespace, the function will immediately return true. This could mask problems where operator pods fail to be created. It's safer to ensure at least one pod is present and that all of them are ready.

I've provided a suggestion to address both of these points to make the wait logic more reliable.

func WaitForOperatorPodsReady() {
    Eventually(func() bool {
        podList := &corev1.PodList{}
        err := clients.List(context.Background(), podList,
            client.InNamespace(operatorNamespace))
        if err != nil {
            return false
        }

        if len(podList.Items) == 0 {
            return false
        }

        for _, pod := range podList.Items {
            if pod.Status.Phase != corev1.PodRunning {
                return false
            }

            isReady := false
            for _, cond := range pod.Status.Conditions {
                if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
                    isReady = true
                    break
                }
            }

            if !isReady {
                return false
            }
        }

        return true
    }, 2*time.Minute, 5*time.Second).Should(BeTrue())
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashokpariya0 please check this comment it's right you need to fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. change

@ashokpariya0 ashokpariya0 force-pushed the wait-for-operator-pods branch from 5b53f83 to e0204cd Compare March 12, 2026 07:08
@coveralls
Copy link

coveralls commented Mar 12, 2026

Pull Request Test Coverage Report for Build 23037478770

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.01%) to 63.353%

Files with Coverage Reduction New Missed Lines %
controllers/drain_controller_helper.go 1 66.96%
pkg/daemon/config.go 2 93.75%
Totals Coverage Status
Change from base Build 22930097770: 0.01%
Covered Lines: 9432
Relevant Lines: 14888

💛 - Coveralls

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. Left a comment

})

func WaitForOperatorPodsReady() {
Eventually(func() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please use the Eventually(func(g Gomega) { sintax?
It allows making assertions in the Eventually block, giving way more information when the assertion fails.

L345 has an example of it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this from my side!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. change

Creating SriovNetworkPoolConfig may trigger node reboot during
SR-IOV reconfiguration. After the reboot, SR-IOV operator pods
(network-resources-injector, operator-webhook) take some
time to restart.

The test could proceed before these pods are ready, causing the
admission webhook to timeout when creating SriovNetworkNodePolicy.

Add an explicit wait for operator namespace pods to reach the
Running/Ready state before continuing the test to avoid this race
condition.

Signed-off-by: Ashok Pariya <ashok.pariya@ibm.com>
@ashokpariya0 ashokpariya0 force-pushed the wait-for-operator-pods branch from e0204cd to 98f4e99 Compare March 13, 2026 05:23
@ashokpariya0
Copy link
Contributor Author

Failure in below job seems not related to this pr changes.
Test SR-IOV Operator / ocp (pull_request)

Comment on lines +382 to +413
Eventually(func(g Gomega) bool {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
if err != nil {
return false
}

if len(podList.Items) == 0 {
return false
}

for _, pod := range podList.Items {
if pod.Status.Phase != corev1.PodRunning {
return false
}

isReady := false
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
isReady = true
break
}
}

if !isReady {
return false
}
}

return true
}, 2*time.Minute, 5*time.Second).Should(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was meaning also to use the g Gomega parameter. Something like this

Suggested change
Eventually(func(g Gomega) bool {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
if err != nil {
return false
}
if len(podList.Items) == 0 {
return false
}
for _, pod := range podList.Items {
if pod.Status.Phase != corev1.PodRunning {
return false
}
isReady := false
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
isReady = true
break
}
}
if !isReady {
return false
}
}
return true
}, 2*time.Minute, 5*time.Second).Should(BeTrue())
Eventually(func(g Gomega) {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
g.Expect(err).ToNot(HaveOccurred())
g.Expect(podList.Items).ToNot(BeEmpty())
for _, p := range podList.Items {
g.Expect(p.Status.Phase).To(Equal(corev1.PodRunning))
var ready bool
for _, cond := range p.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
ready = true
break
}
}
g.Expect(ready).To(BeTrue())
}
}, 2*time.Minute, 5*time.Second).Should(BeTrue())

each g.Expect(...) make the Eventually block fails (not the entire test), giving some useful information about what went wrong (like empty pod list, not ready condition, not running phase, ...)


By("checking the amount of allocatable devices remains after device plugin reset")
Consistently(func() int64 {
Eventually(func() int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I think the Consistently block was used on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see some issue here, so we delete test intentionally deletes the device plugin pod which restart pod,
so before-https://github.com/ashokpariya0/sriov-network-operator/blob/master/test/conformance/tests/test_networkpool.go#L270-L277
we should wait for pod to comes back into running state if we are using Consistently ???
what i see is allocatable resources changes after pod deleted and pods come back to running state,

was seeing below error with Consistently block
Sriov state is stable
STEP: =========restart device plugin @ 03/12/26 16:50:21.233
STEP: checking the amount of allocatable devices remains after device plugin reset @ 03/12/26 16:50:22.244
[FAILED] in [It] - /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124
Waiting for the sriov state to stable
Sriov state is stable
• [FAILED] [793.176 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in exclusive mode [It] should run pod with RDMA cni and expose nic metrics and another one without rdma info
/root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:225

[FAILED] Failed after 5.009s.
Expected
: 0
to equal
: 5
In [It] at: /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124

It indicates the test checked the node while the device plugin was still restarting. so
test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value. what do you think??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary-
during the test we explicitly delete the sriov-device-plugin pod to simulate a plugin restart. When the plugin disconnects from kubelet, Kubernetes temporarily removes the advertised SR-IOV resources from the node status until the plugin re-registers. During this window the allocatable value can briefly drop to 0, which is what we observed in the failure (Expected 0 to equal 5). The test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value, right??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants