Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 41 additions & 2 deletions test/conformance/tests/test_networkpool.go
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,7 @@ var _ = Describe("[sriov] NetworkPool", Ordered, func() {

By("waiting for operator to finish the configuration")
WaitForSRIOVStable()
WaitForOperatorPodsReady()
nodeState := &sriovv1.SriovNetworkNodeState{}
Eventually(func(g Gomega) {
err = clients.Get(context.Background(), client.ObjectKey{Name: testNode, Namespace: operatorNamespace}, nodeState)
Expand All @@ -229,6 +230,7 @@ var _ = Describe("[sriov] NetworkPool", Ordered, func() {

By("waiting for operator to finish the configuration")
WaitForSRIOVStable()
WaitForOperatorPodsReady()
podDefinition := pod.DefineWithNetworks([]string{"test-rdmanetwork"})
firstPod, err := clients.Pods(namespaces.Test).Create(context.Background(), podDefinition, metav1.CreateOptions{})
Expect(err).ToNot(HaveOccurred())
Expand Down Expand Up @@ -268,13 +270,13 @@ var _ = Describe("[sriov] NetworkPool", Ordered, func() {
}

By("checking the amount of allocatable devices remains after device plugin reset")
Consistently(func() int64 {
Eventually(func() int64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? I think the Consistently block was used on purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see some issue here, so we delete test intentionally deletes the device plugin pod which restart pod,
so before-https://github.com/ashokpariya0/sriov-network-operator/blob/master/test/conformance/tests/test_networkpool.go#L270-L277
we should wait for pod to comes back into running state if we are using Consistently ???
what i see is allocatable resources changes after pod deleted and pods come back to running state,

was seeing below error with Consistently block
Sriov state is stable
STEP: =========restart device plugin @ 03/12/26 16:50:21.233
STEP: checking the amount of allocatable devices remains after device plugin reset @ 03/12/26 16:50:22.244
[FAILED] in [It] - /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124
Waiting for the sriov state to stable
Sriov state is stable
• [FAILED] [793.176 seconds]
[sriov] NetworkPool Check rdma metrics inside a pod in exclusive mode [It] should run pod with RDMA cni and expose nic metrics and another one without rdma info
/root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:225

[FAILED] Failed after 5.009s.
Expected
: 0
to equal
: 5
In [It] at: /root/ap/ap-srio-repo/sriov-network-operator/test/conformance/tests/test_networkpool.go:286 @ 03/12/26 16:50:47.124

It indicates the test checked the node while the device plugin was still restarting. so
test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value. what do you think??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary-
during the test we explicitly delete the sriov-device-plugin pod to simulate a plugin restart. When the plugin disconnects from kubelet, Kubernetes temporarily removes the advertised SR-IOV resources from the node status until the plugin re-registers. During this window the allocatable value can briefly drop to 0, which is what we observed in the failure (Expected 0 to equal 5). The test therefore needs to wait for the device plugin to finish re-registering before validating that the allocatable value returns to the original value, right??

err = clients.Get(context.Background(), client.ObjectKey{Name: testNode}, testedNode)
Expect(err).ToNot(HaveOccurred())
resNum := testedNode.Status.Allocatable[corev1.ResourceName("openshift.io/"+resourceName)]
newAllocatable, _ := resNum.AsInt64()
return newAllocatable
}, 1*time.Minute, 5*time.Second).Should(Equal(allocatable))
}, 2*time.Minute, 5*time.Second).Should(Equal(allocatable))

By("checking counters inside the pods")
strOut, _, err := pod.ExecCommand(clients, firstPod, "/bin/bash", "-c", "ip link show net1")
Expand Down Expand Up @@ -340,6 +342,7 @@ var _ = Describe("[sriov] NetworkPool", Ordered, func() {
Expect(err).ToNot(HaveOccurred())
By("waiting for operator to finish the configuration")
WaitForSRIOVStable()
WaitForOperatorPodsReady()
nodeState := &sriovv1.SriovNetworkNodeState{}
Eventually(func(g Gomega) {
err = clients.Get(context.Background(), client.ObjectKey{Name: testNode, Namespace: operatorNamespace}, nodeState)
Expand All @@ -355,6 +358,7 @@ var _ = Describe("[sriov] NetworkPool", Ordered, func() {
func(policy *sriovv1.SriovNetworkNodePolicy) { policy.Spec.IsRdma = true })
Expect(err).ToNot(HaveOccurred())
WaitForSRIOVStable()
WaitForOperatorPodsReady()

podDefinition := pod.DefineWithNetworks([]string{"test-rdmanetwork"})
firstPod, err := clients.Pods(namespaces.Test).Create(context.Background(), podDefinition, metav1.CreateOptions{})
Expand All @@ -373,3 +377,38 @@ var _ = Describe("[sriov] NetworkPool", Ordered, func() {
})
})
})

func WaitForOperatorPodsReady() {
Eventually(func(g Gomega) bool {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
if err != nil {
return false
}

if len(podList.Items) == 0 {
return false
}

for _, pod := range podList.Items {
if pod.Status.Phase != corev1.PodRunning {
return false
}

isReady := false
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
isReady = true
break
}
}

if !isReady {
return false
}
}

return true
}, 2*time.Minute, 5*time.Second).Should(BeTrue())
Comment on lines +382 to +413
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was meaning also to use the g Gomega parameter. Something like this

Suggested change
Eventually(func(g Gomega) bool {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
if err != nil {
return false
}
if len(podList.Items) == 0 {
return false
}
for _, pod := range podList.Items {
if pod.Status.Phase != corev1.PodRunning {
return false
}
isReady := false
for _, cond := range pod.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
isReady = true
break
}
}
if !isReady {
return false
}
}
return true
}, 2*time.Minute, 5*time.Second).Should(BeTrue())
Eventually(func(g Gomega) {
podList := &corev1.PodList{}
err := clients.List(context.Background(), podList,
client.InNamespace(operatorNamespace))
g.Expect(err).ToNot(HaveOccurred())
g.Expect(podList.Items).ToNot(BeEmpty())
for _, p := range podList.Items {
g.Expect(p.Status.Phase).To(Equal(corev1.PodRunning))
var ready bool
for _, cond := range p.Status.Conditions {
if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
ready = true
break
}
}
g.Expect(ready).To(BeTrue())
}
}, 2*time.Minute, 5*time.Second).Should(BeTrue())

each g.Expect(...) make the Eventually block fails (not the entire test), giving some useful information about what went wrong (like empty pod list, not ready condition, not running phase, ...)

}
Comment on lines +381 to +414

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of WaitForOperatorPodsReady has a couple of issues that could lead to incorrect behavior:

  1. Flawed Readiness Check: The logic to check if a pod is ready is not robust. It fails to correctly handle the case where a pod is in the Running phase but its PodReady condition is not yet present or is False. A pod should only be considered ready if its PodReady condition is present and has a status of True.
  2. Empty Pod List: The function doesn't handle the case where the pod list is empty. If no pods are found in the operator namespace, the function will immediately return true. This could mask problems where operator pods fail to be created. It's safer to ensure at least one pod is present and that all of them are ready.

I've provided a suggestion to address both of these points to make the wait logic more reliable.

func WaitForOperatorPodsReady() {
    Eventually(func() bool {
        podList := &corev1.PodList{}
        err := clients.List(context.Background(), podList,
            client.InNamespace(operatorNamespace))
        if err != nil {
            return false
        }

        if len(podList.Items) == 0 {
            return false
        }

        for _, pod := range podList.Items {
            if pod.Status.Phase != corev1.PodRunning {
                return false
            }

            isReady := false
            for _, cond := range pod.Status.Conditions {
                if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
                    isReady = true
                    break
                }
            }

            if !isReady {
                return false
            }
        }

        return true
    }, 2*time.Minute, 5*time.Second).Should(BeTrue())
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashokpariya0 please check this comment it's right you need to fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. change

Loading