test: wait for operator pods readiness after networkpool configuration by ashokpariya0 · Pull Request #1056 · k8snetworkplumbingwg/sriov-network-operator

ashokpariya0 · 2026-03-12T07:00:21Z

creating SriovNetworkPoolConfig(

sriov-network-operator/test/conformance/tests/test_networkpool.go

Line 210 in 808502c

err = clients.Create(context.Background(), networkPool)

) may trigger node reboot during SR-IOV reconfiguration.
after the reboot, SR-IOV operator pods (network-resources-injector, operator-webhook) take some time to restart.
we can below in container creating state after node reboot even after log msg- Sriov state is stable

$oc get all -n openshift-sriov-network-operator
Warning: kubevirt.io/v1 VirtualMachineInstancePresets is now deprecated and will be removed in v2.
NAME READY STATUS RESTARTS AGE
pod/network-resources-injector-nxkm5 0/1 ContainerCreating 15 25h
pod/operator-webhook-c9rtw 0/1 ContainerCreating 16 25h
pod/sriov-network-config-daemon-5lpms 1/1 Running 1 7m59s
pod/sriov-network-operator-bfb9b97cc-qdkj2 1/1 Running 16 25h
pod/testpod-79dmm 1/1 Running 9 21h

The test could proceed before these pods are ready, causing the admission webhook to timeout when creating SriovNetworkNodePolicy.

Add an explicit wait for operator namespace pods to reach the Running/Ready state before continuing the test to avoid this race condition.

NetworkPool conformance tests may fail intermittently after creating
SriovNetworkPoolConfig because the node can reboot as part of the
configuration process.

After the reboot, several SR-IOV operator components restart, including
the sriov-network-operator, sriov-network-config-daemon and device plugin
pods. The test may proceed while these pods are still initializing.

When the test immediately creates a SriovNetworkNodePolicy, the admission
webhook served by the sriov-network-operator may not yet be ready,
resulting in errors such as:

failed calling webhook "operator-webhook.sriovnetwork.openshift.io":
context deadline exceeded

To avoid this race condition, the test now waits for the SR-IOV operator
pods in the operator namespace to reach the Running/Ready state before
continuing with further configuration steps.

This ensures the admission webhook is available and stabilizes the
NetworkPool RDMA tests in environments where node reboot occurs.

we see below errors:

[sriov] NetworkPool Check rdma metrics inside a pod in shared mode not exist should run pod without RDMA cni and not expose nic metrics
/root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:355
Waiting for the sriov state to stable
Sriov state is stable
STEP: Testing on node master-0.ocp-ashok.lnxero1.boe, 2 devices found @ 03/11/26 10:59:52.782
Waiting for the sriov state to stable
Sriov state is stable
STEP: Creating sriov network to use the rdma device @ 03/11/26 11:01:42.853
STEP: waiting for operator to finish the configuration @ 03/11/26 11:01:43.89
Waiting for the sriov state to stable
Sriov state is stable
STEP: creating a policy @ 03/11/26 11:06:08.645
[FAILED] in [It] - /root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:359 @ 03/11/26 11:06:36.928
Waiting for the sriov state to stable

Sriov state is stable
• [FAILED] [720.567 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in shared mode not exist [It] should run pod without RDMA cni and not expose nic metrics
/root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:355

[FAILED] Unexpected error:
<*errors.StatusError | 0xc000453400>:
Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s": context deadline exceeded
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s\": context deadline exceeded",
Reason: "InternalError",
Details: {
Name: "",
Group: "",
Kind: "",
UID: "",
Causes: [
{
Type: "",
Message: "failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s": context deadline exceeded",
Field: "",
},
],
RetryAfterSeconds: 0,
},
Code: 500,
},
}
occurred
In [It] at: /root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:359 @ 03/11/26 11:06:36.928

github-actions · 2026-03-12T07:00:29Z

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

gemini-code-assist · 2026-03-12T07:00:35Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the reliability of SR-IOV NetworkPool conformance tests. It resolves a race condition where tests would proceed before critical operator components were fully ready after a node reboot, leading to webhook timeouts and test failures. By introducing an explicit waiting mechanism, the tests are now more robust and stable, especially in scenarios involving SR-IOV reconfigurations that trigger node reboots.

Highlights

Test Stability: Introduced an explicit wait for SR-IOV operator pods to reach a Running/Ready state before proceeding with test steps, preventing race conditions.
Race Condition Mitigation: Addressed intermittent test failures caused by operator webhooks not being ready after node reboots during SriovNetworkPoolConfig creation.
Conformance Tests: Stabilized NetworkPool RDMA conformance tests, particularly in environments where node reboots occur during SR-IOV reconfiguration.

Changelog

test/conformance/tests/test_networkpool.go
- Added calls to WaitForOperatorPodsReady() after WaitForSRIOVStable() in two test blocks.
- Implemented a new function WaitForOperatorPodsReady() to poll for all operator namespace pods to be in Running phase and Ready condition.

Activity

No review comments or other activity recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a wait for operator pods to be ready, which is a good addition to stabilize the tests. However, the implementation of the wait function WaitForOperatorPodsReady has a logical flaw in how it determines pod readiness and also doesn't account for an empty list of pods. I've provided a suggestion to make the function more robust.

gemini-code-assist · 2026-03-12T07:01:55Z

test/conformance/tests/test_networkpool.go

+func WaitForOperatorPodsReady() {
+    Eventually(func() bool {
+        podList := &corev1.PodList{}
+        err := clients.List(context.Background(), podList,
+            client.InNamespace(operatorNamespace))
+        if err != nil {
+            return false
+        }
+
+        for _, pod := range podList.Items {
+            if pod.Status.Phase != corev1.PodRunning {
+                return false
+            }
+
+            for _, cond := range pod.Status.Conditions {
+                if cond.Type == corev1.PodReady && cond.Status != corev1.ConditionTrue {
+                    return false
+                }
+            }
+        }
+
+        return true
+    }, 2*time.Minute, 5*time.Second).Should(BeTrue())
+}


The current implementation of WaitForOperatorPodsReady has a couple of issues that could lead to incorrect behavior:

Flawed Readiness Check: The logic to check if a pod is ready is not robust. It fails to correctly handle the case where a pod is in the Running phase but its PodReady condition is not yet present or is False. A pod should only be considered ready if its PodReady condition is present and has a status of True.

Empty Pod List: The function doesn't handle the case where the pod list is empty. If no pods are found in the operator namespace, the function will immediately return true. This could mask problems where operator pods fail to be created. It's safer to ensure at least one pod is present and that all of them are ready.

I've provided a suggestion to address both of these points to make the wait logic more reliable.

func WaitForOperatorPodsReady() { Eventually(func() bool { podList := &corev1.PodList{} err := clients.List(context.Background(), podList, client.InNamespace(operatorNamespace)) if err != nil { return false } if len(podList.Items) == 0 { return false } for _, pod := range podList.Items { if pod.Status.Phase != corev1.PodRunning { return false } isReady := false for _, cond := range pod.Status.Conditions { if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue { isReady = true break } } if !isReady { return false } } return true }, 2*time.Minute, 5*time.Second).Should(BeTrue()) }

@ashokpariya0 please check this comment it's right you need to fix it

Done. change

coveralls · 2026-03-12T07:21:39Z

Pull Request Test Coverage Report for Build 23037478770

Details

0 of 0 changed or added relevant lines in 0 files are covered.
3 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.01%) to 63.353%

Files with Coverage Reduction	New Missed Lines	%
controllers/drain_controller_helper.go	1	66.96%
pkg/daemon/config.go	2	93.75%

Totals
Change from base Build 22930097770:	0.01%
Covered Lines:	9432
Relevant Lines:	14888

💛 - Coveralls

zeeke

Thanks for fixing this. Left a comment

zeeke · 2026-03-12T08:52:25Z

test/conformance/tests/test_networkpool.go

 })
+
+func WaitForOperatorPodsReady() {
+	Eventually(func() bool {


can you please use the Eventually(func(g Gomega) { sintax?
It allows making assertions in the Eventually block, giving way more information when the assertion fails.

L345 has an example of it

+1 on this from my side!

Done. change

Creating SriovNetworkPoolConfig may trigger node reboot during SR-IOV reconfiguration. After the reboot, SR-IOV operator pods (network-resources-injector, operator-webhook) take some time to restart. The test could proceed before these pods are ready, causing the admission webhook to timeout when creating SriovNetworkNodePolicy. Add an explicit wait for operator namespace pods to reach the Running/Ready state before continuing the test to avoid this race condition. Signed-off-by: Ashok Pariya <ashok.pariya@ibm.com>

ashokpariya0 · 2026-03-13T07:03:07Z

Failure in below job seems not related to this pr changes.
Test SR-IOV Operator / ocp (pull_request)

zeeke · 2026-03-13T08:50:56Z

test/conformance/tests/test_networkpool.go

+	Eventually(func(g Gomega) bool {
+		podList := &corev1.PodList{}
+		err := clients.List(context.Background(), podList,
+			client.InNamespace(operatorNamespace))
+		if err != nil {
+			return false
+		}
+
+		if len(podList.Items) == 0 {
+			return false
+		}
+
+		for _, pod := range podList.Items {
+			if pod.Status.Phase != corev1.PodRunning {
+				return false
+			}
+
+			isReady := false
+			for _, cond := range pod.Status.Conditions {
+				if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
+					isReady = true
+					break
+				}
+			}
+
+			if !isReady {
+				return false
+			}
+		}
+
+		return true
+	}, 2*time.Minute, 5*time.Second).Should(BeTrue())


Sorry, I was meaning also to use the g Gomega parameter. Something like this

Suggested change

Eventually(func(g Gomega) bool {

podList := &corev1.PodList{}

err := clients.List(context.Background(), podList,

client.InNamespace(operatorNamespace))

if err != nil {

return false

}

if len(podList.Items) == 0 {

return false

}

for _, pod := range podList.Items {

if pod.Status.Phase != corev1.PodRunning {

return false

}

isReady := false

for _, cond := range pod.Status.Conditions {

if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {

isReady = true

break

}

}

if !isReady {

return false

}

}

return true

}, 2*time.Minute, 5*time.Second).Should(BeTrue())

Eventually(func(g Gomega) {

podList := &corev1.PodList{}

err := clients.List(context.Background(), podList,

client.InNamespace(operatorNamespace))

g.Expect(err).ToNot(HaveOccurred())

g.Expect(podList.Items).ToNot(BeEmpty())

for _, p := range podList.Items {

g.Expect(p.Status.Phase).To(Equal(corev1.PodRunning))

var ready bool

for _, cond := range p.Status.Conditions {

if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {

ready = true

break

}

}

g.Expect(ready).To(BeTrue())

}

}, 2*time.Minute, 5*time.Second).Should(BeTrue())

each g.Expect(...) make the Eventually block fails (not the entire test), giving some useful information about what went wrong (like empty pod list, not ready condition, not running phase, ...)

zeeke · 2026-03-13T08:52:26Z

test/conformance/tests/test_networkpool.go


 			By("checking the amount of allocatable devices remains after device plugin reset")
-			Consistently(func() int64 {
+			Eventually(func() int64 {


Why this change? I think the Consistently block was used on purpose.

i see some issue here, so we delete test intentionally deletes the device plugin pod which restart pod,
so before-https://github.com/ashokpariya0/sriov-network-operator/blob/master/test/conformance/tests/test_networkpool.go#L270-L277
we should wait for pod to comes back into running state if we are using Consistently ???
what i see is allocatable resources changes after pod deleted and pods come back to running state,

was seeing below error with Consistently block
Sriov state is stable
STEP: =========restart device plugin @ 03/12/26 16:50:21.233
STEP: checking the amount of allocatable devices remains after device plugin reset @ 03/12/26 16:50:22.244
[FAILED] in [It] - /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124
Waiting for the sriov state to stable
Sriov state is stable
• [FAILED] [793.176 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in exclusive mode [It] should run pod with RDMA cni and expose nic metrics and another one without rdma info
/root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:225

[FAILED] Failed after 5.009s.
Expected
: 0
to equal
: 5
In [It] at: /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124

It indicates the test checked the node while the device plugin was still restarting. so
test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value. what do you think??

Summary-
during the test we explicitly delete the sriov-device-plugin pod to simulate a plugin restart. When the plugin disconnects from kubelet, Kubernetes temporarily removes the advertised SR-IOV resources from the node status until the plugin re-registers. During this window the allocatable value can briefly drop to 0, which is what we observed in the failure (Expected 0 to equal 5). The test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value, right??

github-actions bot added the tests label Mar 12, 2026

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

ashokpariya0 force-pushed the wait-for-operator-pods branch from 5b53f83 to e0204cd Compare March 12, 2026 07:08

zeeke reviewed Mar 12, 2026

View reviewed changes

ashokpariya0 force-pushed the wait-for-operator-pods branch from e0204cd to 98f4e99 Compare March 13, 2026 05:23

zeeke reviewed Mar 13, 2026

View reviewed changes

Conversation

ashokpariya0 commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 23037478770

Details

💛 - Coveralls

Uh oh!

zeeke left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashokpariya0 commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Mar 12, 2026 •

edited

Loading