Skip to content

tests/ocp/sriov: fix SR-IOV operator reinstallation test failures#1303

Open
zhiqiangf wants to merge 6 commits intorh-ecosystem-edge:mainfrom
zhiqiangf:fix-sriov-operator-reinstallation
Open

tests/ocp/sriov: fix SR-IOV operator reinstallation test failures#1303
zhiqiangf wants to merge 6 commits intorh-ecosystem-edge:mainfrom
zhiqiangf:fix-sriov-operator-reinstallation

Conversation

@zhiqiangf
Copy link
Copy Markdown
Contributor

@zhiqiangf zhiqiangf commented Mar 27, 2026

Summary

Fixes three bugs in reinstallation.go that caused tests 46530–46532 to fail consistently.

Root causes

Test 46530 — namespace stuck Terminating

removeSriovOperator() calls sriovNamespace.DeleteAndWait() while a SriovNetworkPoolConfig (created by the lab's enable_offload() step) still exists with a finalizer poolconfig.finalizers.sriovnetwork.openshift. Once namespace deletion starts, the operator controller that would remove the finalizer is also being deleted — a deadlock that leaves the namespace stuck in Terminating indefinitely.

Fix: Add sriov.CleanAllPoolConfigs() before DeleteAndWait(), while the operator is still running and can process the finalizer removal.

Tests 46531 — cascades from 46530 — no independent fix needed.

Test 46532 — daemonsets never start after reinstall

installSriovOperator() created the SriovOperatorConfig immediately after the Subscription, before the operator CSV reached Succeeded and before any controller was running to act on it. As a result the daemonsets (sriov-network-config-daemon, network-resources-injector, operator-webhook) never started.

Fix: Add an Eventually poll for CSV Succeeded before creating SriovOperatorConfig, mirroring the create_default_soc step in the lab's deploy_cluster.sh.

Test 46532 — IsSriovDeployed timeout too short

The Eventually(sriovoperator.IsSriovDeployed, time.Minute, ...) check gave only 1 minute. Daemonsets take 60–90+ seconds to come up after SriovOperatorConfig is created.

Fix: Increase timeout from time.Minute to 5*time.Minute.

Test plan

  • Run --label-filter="sriovreinstall" against tests/ocp/sriov on a cluster with the lab's test-operator-catalog CatalogSource in place
  • All 6 reinstallation tests (46528–46533) should pass

Made with Cursor

Summary by CodeRabbit

  • Tests
    • Increased SR-IOV operator deployment verification timeout to improve reliability during slower installs.
    • Added validation that waits for a network attachment to appear after creating an SR‑IOV network.
    • Ensured pool configurations are fully removed before cleanup proceeds.
    • Added a pre-install gate that waits for the operator's ClusterServiceVersion to report success before creating the default configuration.
    • Subscription creation now honors an optional starting CSV when present.

Three bugs fixed in reinstallation.go:

1. removeSriovOperator(): clean SriovNetworkPoolConfigs before namespace
   deletion. SriovNetworkPoolConfigs carry a finalizer
   (poolconfig.finalizers.sriovnetwork.openshift) set by the operator
   controller. If the namespace deletion starts while these resources
   still exist, the controller responsible for removing the finalizer is
   also being deleted, creating a deadlock that leaves the namespace
   stuck in Terminating indefinitely. Fix: call CleanAllPoolConfigs()
   while the operator is still running, before DeleteAndWait().

2. installSriovOperator(): wait for operator CSV to reach Succeeded phase
   before creating SriovOperatorConfig. Previously the SriovOperatorConfig
   was applied immediately after the Subscription, before the operator
   controller was up. This mirrors the deploy_cluster.sh create_default_soc
   step which polls for the CSV before creating the default config.

3. Test 46532: increase IsSriovDeployed Eventually timeout from 1 minute
   to 5 minutes. The daemonsets take 60-90+ seconds to come up after
   SriovOperatorConfig is created, so 1 minute was frequently too short.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 560f25d1-e84b-47ec-a034-013aab764e81

📥 Commits

Reviewing files that changed from the base of the PR and between 5021369 and 79fcc01.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/reinstallation.go
✅ Files skipped from review due to trivial changes (1)
  • tests/ocp/sriov/tests/reinstallation.go

📝 Walkthrough

Walkthrough

Additions to the SR-IOV operator reinstallation test: wait for a NetworkAttachmentDefinition after creating an SR-IOV Network, longer operator readiness polling, explicit pool-config cleanup and polling before deleting operator config, optional Subscription StartingCSV handling, and a CSV readiness gating step before creating the default SriovOperatorConfig.

Changes

Cohort / File(s) Summary
SR-IOV Operator Test Flow
tests/ocp/sriov/tests/reinstallation.go
Added strings and nad usage; after creating an SR-IOV Network, poll for the NetworkAttachmentDefinition in the test namespace and fail with a clear message if absent; increased sriovoperator.IsSriovDeployed(...).ShouldNot(HaveOccurred()) timeout from time.Minute to 5*time.Minute; before deleting the default SriovOperatorConfig, call sriov.CleanAllPoolConfigs(...) and poll sriov.ListPoolConfigs(...) until length is 0; build Subscription with optional StartingCSV when present and create it; add a gating poll (up to 5*time.Minute) for an OLM ClusterServiceVersion whose name contains "sriov" and csv.IsSuccessful() == true before creating the default SriovOperatorConfig.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: fixing SR-IOV operator reinstallation test failures, which is the primary focus of all changes in the pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/ocp/sriov/tests/reinstallation.go (1)

291-300: Consider filtering CSV by name for robustness.

Using csvList[0] assumes only one CSV exists in the namespace. While this is typically true for a freshly created SR-IOV operator namespace, during upgrade scenarios or if stale CSVs exist, the list order is not guaranteed to return the expected operator CSV.

Consider filtering by the expected CSV name (e.g., matching the operator package name from the subscription) for more deterministic behavior.

♻️ Suggested improvement
 	Eventually(func() bool {
 		csvList, err := olm.ListClusterServiceVersion(APIClient, sriovNamespace.Definition.Name)
 		if err != nil || len(csvList) == 0 {
 			return false
 		}
 
-		succeeded, err := csvList[0].IsSuccessful()
-
-		return err == nil && succeeded
+		for _, csv := range csvList {
+			// Match CSV belonging to the SR-IOV operator package
+			if csv.Definition.Spec.DisplayName == "" || 
+				!strings.Contains(csv.Definition.Name, "sriov") {
+				continue
+			}
+			succeeded, err := csv.IsSuccessful()
+			if err == nil && succeeded {
+				return true
+			}
+		}
+		return false
 	}, 5*time.Minute, tsparams.RetryInterval).Should(BeTrue(),
 		"SR-IOV operator CSV did not reach Succeeded phase within timeout")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ocp/sriov/tests/reinstallation.go` around lines 291 - 300, The code
assumes csvList[0] is the operator CSV; instead, filter the results from
olm.ListClusterServiceVersion(APIClient, sriovNamespace.Definition.Name) for the
expected CSV by name before checking IsSuccessful(). Locate the Subscription for
the SR-IOV operator (or derive the expected CSV name from the
subscription/package) and then iterate csvList to find the item whose Name (or
CSV.Name) matches that expected name; call IsSuccessful() only on that matched
CSV and return false if no match is found.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 147-150: The Eventually block containing
sriovoperator.IsSriovDeployed is misformatted; run gofmt to fix indentation and
alignment (e.g., run `gofmt -w tests/ocp/sriov/tests/reinstallation.go` or `go
fmt ./...`) so the call to Eventually(...).WithArguments(APIClient,
SriovOcpConfig.OcpSriovOperatorNamespace).ShouldNot(HaveOccurred(), ...) is
properly indented and matches surrounding code style.

---

Nitpick comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 291-300: The code assumes csvList[0] is the operator CSV; instead,
filter the results from olm.ListClusterServiceVersion(APIClient,
sriovNamespace.Definition.Name) for the expected CSV by name before checking
IsSuccessful(). Locate the Subscription for the SR-IOV operator (or derive the
expected CSV name from the subscription/package) and then iterate csvList to
find the item whose Name (or CSV.Name) matches that expected name; call
IsSuccessful() only on that matched CSV and return false if no match is found.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 229ccf35-4ed2-446d-9f59-96bbbdd4be69

📥 Commits

Reviewing files that changed from the base of the PR and between 157a359 and 0f5db66.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/reinstallation.go

Comment thread tests/ocp/sriov/tests/reinstallation.go Outdated
@zhiqiangf zhiqiangf marked this pull request as draft March 27, 2026 03:13
Fix indentation of the Eventually block in test 46532 to match gofmt
requirements and unblock CI.

Made-with: Cursor
Comment thread tests/ocp/sriov/tests/reinstallation.go Outdated
Comment on lines +244 to +251
By("Removing SR-IOV network pool configs")

// SriovNetworkPoolConfigs carry a finalizer set by the operator controller.
// They must be deleted while the operator is still running; otherwise the
// finalizer can never be removed and the namespace gets stuck in Terminating.
err = sriov.CleanAllPoolConfigs(APIClient, SriovOcpConfig.SriovOperatorNamespace)
Expect(err).ToNot(HaveOccurred(), "Failed to clean SR-IOV network pool configs")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have the SR-IOV Pool configs on a cluster?

@evgenLevin
Copy link
Copy Markdown
Collaborator

I’m not entirely sure why SriovPoolConfigs are present on this cluster, but since they are already there, the changes LGTM.

Address CodeRabbit review comment: using csvList[0] is fragile if stale
or unrelated CSVs exist in the namespace. Filter by CSV name containing
"sriov" to deterministically match the SR-IOV operator CSV.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zhiqiangf zhiqiangf marked this pull request as ready for review April 9, 2026 14:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 298-306: The loop over csvList currently returns on the first CSV
with "sriov" in its name, which can cause false negatives; instead iterate all
matching CSVs: for each csv where strings.Contains(csv.Definition.Name, "sriov")
call csv.IsSuccessful(), if any call returns (true, nil) immediately return
true, otherwise continue; after the loop return false if no matching successful
CSV was found (and propagate or log any non-nil errors appropriately). Ensure
you update the logic surrounding csvList and csv.IsSuccessful() to avoid
returning on the first match and only return success when at least one matching
CSV is successful.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f2a8db85-d07f-412b-bb4b-8e88aa5d6369

📥 Commits

Reviewing files that changed from the base of the PR and between 0f5db66 and 5bef51c.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/reinstallation.go

Comment thread tests/ocp/sriov/tests/reinstallation.go
Two improvements to removeSriovOperator():

1. Move CleanAllPoolConfigs before sriovOperatorConfig.Delete() so the
   operator controller is guaranteed to still be running and can process
   the finalizer removal on SriovNetworkPoolConfig objects. Previously
   it ran after webhook validation (operator already shutting down),
   which could leave the finalizer unprocessed and the namespace stuck
   in Terminating indefinitely.

2. Add an Eventually wait after CleanAllPoolConfigs to confirm pool
   configs are fully gone (finalizers processed) before namespace
   deletion begins, preventing a race between finalizer removal and
   the namespace termination deadline.

Confirmed by cluster investigation: sriovnetworkpoolconfig-offload
was stuck in Terminating for 14h with finalizer still present after
the April 8 test run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/ocp/sriov/tests/reinstallation.go (1)

306-323: ⚠️ Potential issue | 🟠 Major

CSV readiness check still short-circuits on first match (flaky).

The loop returns immediately on the first CSV whose name contains "sriov". If that first match is not yet succeeded, this can fail even when another SR-IOV CSV is already succeeded.

💡 Proposed fix
 	Eventually(func() bool {
 		csvList, err := olm.ListClusterServiceVersion(APIClient, sriovNamespace.Definition.Name)
 		if err != nil || len(csvList) == 0 {
 			return false
 		}
 
 		for _, csv := range csvList {
-			if !strings.Contains(csv.Definition.Name, "sriov") {
+			if !strings.Contains(csv.Definition.Name, "sriov-network-operator") {
 				continue
 			}
 
 			succeeded, err := csv.IsSuccessful()
-
-			return err == nil && succeeded
+			if err == nil && succeeded {
+				return true
+			}
 		}
 
 		return false
 	}, 5*time.Minute, tsparams.RetryInterval).Should(BeTrue(),
 		"SR-IOV operator CSV did not reach Succeeded phase within timeout")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ocp/sriov/tests/reinstallation.go` around lines 306 - 323, The
readiness check inside the Eventually block short-circuits by returning from
within the for loop on the first CSV match; change the logic in the loop that
iterates csvList so it does not return immediately: for each csv where
strings.Contains(csv.Definition.Name, "sriov"), call csv.IsSuccessful() and if
err == nil && succeeded then return true from the Eventually predicate,
otherwise continue checking remaining CSVs; only after the loop return false.
This touches the Eventually(...) predicate that calls
olm.ListClusterServiceVersion, iterates csvList, checks csv.Definition.Name and
calls csv.IsSuccessful().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 306-323: The readiness check inside the Eventually block
short-circuits by returning from within the for loop on the first CSV match;
change the logic in the loop that iterates csvList so it does not return
immediately: for each csv where strings.Contains(csv.Definition.Name, "sriov"),
call csv.IsSuccessful() and if err == nil && succeeded then return true from the
Eventually predicate, otherwise continue checking remaining CSVs; only after the
loop return false. This touches the Eventually(...) predicate that calls
olm.ListClusterServiceVersion, iterates csvList, checks csv.Definition.Name and
calls csv.IsSuccessful().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d4a5a6d-b532-495f-8680-b2814d95a7c8

📥 Commits

Reviewing files that changed from the base of the PR and between 5bef51c and 5021369.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/reinstallation.go

zhiqiangf and others added 2 commits April 10, 2026 10:14
Wait for NetworkAttachmentDefinition to exist after creating SriovNetwork
before proceeding to pod creation, fixing a race where the operator creates
the NAD asynchronously and the webhook denies pod creation.

Preserve the StartingCSV from the original subscription when reinstalling
the operator, ensuring OLM pins to the same (working) CSV version rather
than resolving the channel head which may not be pullable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The loop returned immediately on the first SR-IOV CSV match, even when
IsSuccessful() returned false, preventing subsequent CSVs from being
checked. Change to return true only on success and continue iterating
otherwise, matching CodeRabbit's review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants