test: wait for operator pods readiness after networkpool configuration#1056
test: wait for operator pods readiness after networkpool configuration#1056ashokpariya0 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
Conversation
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the reliability of SR-IOV NetworkPool conformance tests. It resolves a race condition where tests would proceed before critical operator components were fully ready after a node reboot, leading to webhook timeouts and test failures. By introducing an explicit waiting mechanism, the tests are now more robust and stable, especially in scenarios involving SR-IOV reconfigurations that trigger node reboots. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a wait for operator pods to be ready, which is a good addition to stabilize the tests. However, the implementation of the wait function WaitForOperatorPodsReady has a logical flaw in how it determines pod readiness and also doesn't account for an empty list of pods. I've provided a suggestion to make the function more robust.
| func WaitForOperatorPodsReady() { | ||
| Eventually(func() bool { | ||
| podList := &corev1.PodList{} | ||
| err := clients.List(context.Background(), podList, | ||
| client.InNamespace(operatorNamespace)) | ||
| if err != nil { | ||
| return false | ||
| } | ||
|
|
||
| for _, pod := range podList.Items { | ||
| if pod.Status.Phase != corev1.PodRunning { | ||
| return false | ||
| } | ||
|
|
||
| for _, cond := range pod.Status.Conditions { | ||
| if cond.Type == corev1.PodReady && cond.Status != corev1.ConditionTrue { | ||
| return false | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return true | ||
| }, 2*time.Minute, 5*time.Second).Should(BeTrue()) | ||
| } |
There was a problem hiding this comment.
The current implementation of WaitForOperatorPodsReady has a couple of issues that could lead to incorrect behavior:
- Flawed Readiness Check: The logic to check if a pod is ready is not robust. It fails to correctly handle the case where a pod is in the
Runningphase but itsPodReadycondition is not yet present or isFalse. A pod should only be considered ready if itsPodReadycondition is present and has a status ofTrue. - Empty Pod List: The function doesn't handle the case where the pod list is empty. If no pods are found in the operator namespace, the function will immediately return
true. This could mask problems where operator pods fail to be created. It's safer to ensure at least one pod is present and that all of them are ready.
I've provided a suggestion to address both of these points to make the wait logic more reliable.
func WaitForOperatorPodsReady() {
Eventually(func() bool {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
if err != nil {
return false
}
if len(podList.Items) == 0 {
return false
}
for _, pod := range podList.Items {
if pod.Status.Phase != corev1.PodRunning {
return false
}
isReady := false
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
isReady = true
break
}
}
if !isReady {
return false
}
}
return true
}, 2*time.Minute, 5*time.Second).Should(BeTrue())
}There was a problem hiding this comment.
@ashokpariya0 please check this comment it's right you need to fix it
5b53f83 to
e0204cd
Compare
Pull Request Test Coverage Report for Build 23037478770Details
💛 - Coveralls |
zeeke
left a comment
There was a problem hiding this comment.
Thanks for fixing this. Left a comment
| }) | ||
|
|
||
| func WaitForOperatorPodsReady() { | ||
| Eventually(func() bool { |
There was a problem hiding this comment.
can you please use the Eventually(func(g Gomega) { sintax?
It allows making assertions in the Eventually block, giving way more information when the assertion fails.
L345 has an example of it
Creating SriovNetworkPoolConfig may trigger node reboot during SR-IOV reconfiguration. After the reboot, SR-IOV operator pods (network-resources-injector, operator-webhook) take some time to restart. The test could proceed before these pods are ready, causing the admission webhook to timeout when creating SriovNetworkNodePolicy. Add an explicit wait for operator namespace pods to reach the Running/Ready state before continuing the test to avoid this race condition. Signed-off-by: Ashok Pariya <ashok.pariya@ibm.com>
e0204cd to
98f4e99
Compare
|
Failure in below job seems not related to this pr changes. |
| Eventually(func(g Gomega) bool { | ||
| podList := &corev1.PodList{} | ||
| err := clients.List(context.Background(), podList, | ||
| client.InNamespace(operatorNamespace)) | ||
| if err != nil { | ||
| return false | ||
| } | ||
|
|
||
| if len(podList.Items) == 0 { | ||
| return false | ||
| } | ||
|
|
||
| for _, pod := range podList.Items { | ||
| if pod.Status.Phase != corev1.PodRunning { | ||
| return false | ||
| } | ||
|
|
||
| isReady := false | ||
| for _, cond := range pod.Status.Conditions { | ||
| if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue { | ||
| isReady = true | ||
| break | ||
| } | ||
| } | ||
|
|
||
| if !isReady { | ||
| return false | ||
| } | ||
| } | ||
|
|
||
| return true | ||
| }, 2*time.Minute, 5*time.Second).Should(BeTrue()) |
There was a problem hiding this comment.
Sorry, I was meaning also to use the g Gomega parameter. Something like this
| Eventually(func(g Gomega) bool { | |
| podList := &corev1.PodList{} | |
| err := clients.List(context.Background(), podList, | |
| client.InNamespace(operatorNamespace)) | |
| if err != nil { | |
| return false | |
| } | |
| if len(podList.Items) == 0 { | |
| return false | |
| } | |
| for _, pod := range podList.Items { | |
| if pod.Status.Phase != corev1.PodRunning { | |
| return false | |
| } | |
| isReady := false | |
| for _, cond := range pod.Status.Conditions { | |
| if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue { | |
| isReady = true | |
| break | |
| } | |
| } | |
| if !isReady { | |
| return false | |
| } | |
| } | |
| return true | |
| }, 2*time.Minute, 5*time.Second).Should(BeTrue()) | |
| Eventually(func(g Gomega) { | |
| podList := &corev1.PodList{} | |
| err := clients.List(context.Background(), podList, | |
| client.InNamespace(operatorNamespace)) | |
| g.Expect(err).ToNot(HaveOccurred()) | |
| g.Expect(podList.Items).ToNot(BeEmpty()) | |
| for _, p := range podList.Items { | |
| g.Expect(p.Status.Phase).To(Equal(corev1.PodRunning)) | |
| var ready bool | |
| for _, cond := range p.Status.Conditions { | |
| if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue { | |
| ready = true | |
| break | |
| } | |
| } | |
| g.Expect(ready).To(BeTrue()) | |
| } | |
| }, 2*time.Minute, 5*time.Second).Should(BeTrue()) |
each g.Expect(...) make the Eventually block fails (not the entire test), giving some useful information about what went wrong (like empty pod list, not ready condition, not running phase, ...)
|
|
||
| By("checking the amount of allocatable devices remains after device plugin reset") | ||
| Consistently(func() int64 { | ||
| Eventually(func() int64 { |
There was a problem hiding this comment.
Why this change? I think the Consistently block was used on purpose.
There was a problem hiding this comment.
i see some issue here, so we delete test intentionally deletes the device plugin pod which restart pod,
so before-https://github.com/ashokpariya0/sriov-network-operator/blob/master/test/conformance/tests/test_networkpool.go#L270-L277
we should wait for pod to comes back into running state if we are using Consistently ???
what i see is allocatable resources changes after pod deleted and pods come back to running state,
was seeing below error with Consistently block
Sriov state is stable
STEP: =========restart device plugin @ 03/12/26 16:50:21.233
STEP: checking the amount of allocatable devices remains after device plugin reset @ 03/12/26 16:50:22.244
[FAILED] in [It] - /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124
Waiting for the sriov state to stable
Sriov state is stable
• [FAILED] [793.176 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in exclusive mode [It] should run pod with RDMA cni and expose nic metrics and another one without rdma info
/root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:225
[FAILED] Failed after 5.009s.
Expected
: 0
to equal
: 5
In [It] at: /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124
It indicates the test checked the node while the device plugin was still restarting. so
test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value. what do you think??
There was a problem hiding this comment.
Summary-
during the test we explicitly delete the sriov-device-plugin pod to simulate a plugin restart. When the plugin disconnects from kubelet, Kubernetes temporarily removes the advertised SR-IOV resources from the node status until the plugin re-registers. During this window the allocatable value can briefly drop to 0, which is what we observed in the failure (Expected 0 to equal 5). The test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value, right??
creating SriovNetworkPoolConfig(
sriov-network-operator/test/conformance/tests/test_networkpool.go
Line 210 in 808502c
after the reboot, SR-IOV operator pods (network-resources-injector, operator-webhook) take some time to restart.
we can below in container creating state after node reboot even after log msg- Sriov state is stable
$oc get all -n openshift-sriov-network-operator
Warning: kubevirt.io/v1 VirtualMachineInstancePresets is now deprecated and will be removed in v2.
NAME READY STATUS RESTARTS AGE
pod/network-resources-injector-nxkm5 0/1 ContainerCreating 15 25h
pod/operator-webhook-c9rtw 0/1 ContainerCreating 16 25h
pod/sriov-network-config-daemon-5lpms 1/1 Running 1 7m59s
pod/sriov-network-operator-bfb9b97cc-qdkj2 1/1 Running 16 25h
pod/testpod-79dmm 1/1 Running 9 21h
The test could proceed before these pods are ready, causing the admission webhook to timeout when creating SriovNetworkNodePolicy.
Add an explicit wait for operator namespace pods to reach the Running/Ready state before continuing the test to avoid this race condition.
NetworkPool conformance tests may fail intermittently after creating
SriovNetworkPoolConfig because the node can reboot as part of the
configuration process.
After the reboot, several SR-IOV operator components restart, including
the sriov-network-operator, sriov-network-config-daemon and device plugin
pods. The test may proceed while these pods are still initializing.
When the test immediately creates a SriovNetworkNodePolicy, the admission
webhook served by the sriov-network-operator may not yet be ready,
resulting in errors such as:
failed calling webhook "operator-webhook.sriovnetwork.openshift.io":
context deadline exceeded
To avoid this race condition, the test now waits for the SR-IOV operator
pods in the operator namespace to reach the Running/Ready state before
continuing with further configuration steps.
This ensures the admission webhook is available and stabilizes the
NetworkPool RDMA tests in environments where node reboot occurs.
we see below errors:
[sriov] NetworkPool Check rdma metrics inside a pod in shared mode not exist should run pod without RDMA cni and not expose nic metrics
/root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:355
Waiting for the sriov state to stable
Sriov state is stable
STEP: Testing on node master-0.ocp-ashok.lnxero1.boe, 2 devices found @ 03/11/26 10:59:52.782
Waiting for the sriov state to stable
Sriov state is stable
STEP: Creating sriov network to use the rdma device @ 03/11/26 11:01:42.853
STEP: waiting for operator to finish the configuration @ 03/11/26 11:01:43.89
Waiting for the sriov state to stable
Sriov state is stable
STEP: creating a policy @ 03/11/26 11:06:08.645
[FAILED] in [It] - /root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:359 @ 03/11/26 11:06:36.928
Waiting for the sriov state to stable
Sriov state is stable
• [FAILED] [720.567 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in shared mode not exist [It] should run pod without RDMA cni and not expose nic metrics
/root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:355
[FAILED] Unexpected error:
<*errors.StatusError | 0xc000453400>:
Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s": context deadline exceeded
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "Internal error occurred: failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s\": context deadline exceeded",
Reason: "InternalError",
Details: {
Name: "",
Group: "",
Kind: "",
UID: "",
Causes: [
{
Type: "",
Message: "failed calling webhook "operator-webhook.sriovnetwork.openshift.io": failed to call webhook: Post "https://operator-webhook-service.openshift-sriov-network-operator.svc:443/mutating-custom-resource?timeout=10s": context deadline exceeded",
Field: "",
},
],
RetryAfterSeconds: 0,
},
Code: 500,
},
}
occurred
In [It] at: /root/ap/sriov-network-operator/test/conformance/tests/test_networkpool.go:359 @ 03/11/26 11:06:36.928