tests/ocp/sriov: fix SR-IOV operator reinstallation test failures#1303
tests/ocp/sriov: fix SR-IOV operator reinstallation test failures#1303zhiqiangf wants to merge 6 commits intorh-ecosystem-edge:mainfrom
Conversation
Three bugs fixed in reinstallation.go: 1. removeSriovOperator(): clean SriovNetworkPoolConfigs before namespace deletion. SriovNetworkPoolConfigs carry a finalizer (poolconfig.finalizers.sriovnetwork.openshift) set by the operator controller. If the namespace deletion starts while these resources still exist, the controller responsible for removing the finalizer is also being deleted, creating a deadlock that leaves the namespace stuck in Terminating indefinitely. Fix: call CleanAllPoolConfigs() while the operator is still running, before DeleteAndWait(). 2. installSriovOperator(): wait for operator CSV to reach Succeeded phase before creating SriovOperatorConfig. Previously the SriovOperatorConfig was applied immediately after the Subscription, before the operator controller was up. This mirrors the deploy_cluster.sh create_default_soc step which polls for the CSV before creating the default config. 3. Test 46532: increase IsSriovDeployed Eventually timeout from 1 minute to 5 minutes. The daemonsets take 60-90+ seconds to come up after SriovOperatorConfig is created, so 1 minute was frequently too short. Made-with: Cursor
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdditions to the SR-IOV operator reinstallation test: wait for a NetworkAttachmentDefinition after creating an SR-IOV Network, longer operator readiness polling, explicit pool-config cleanup and polling before deleting operator config, optional Subscription StartingCSV handling, and a CSV readiness gating step before creating the default SriovOperatorConfig. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/ocp/sriov/tests/reinstallation.go (1)
291-300: Consider filtering CSV by name for robustness.Using
csvList[0]assumes only one CSV exists in the namespace. While this is typically true for a freshly created SR-IOV operator namespace, during upgrade scenarios or if stale CSVs exist, the list order is not guaranteed to return the expected operator CSV.Consider filtering by the expected CSV name (e.g., matching the operator package name from the subscription) for more deterministic behavior.
♻️ Suggested improvement
Eventually(func() bool { csvList, err := olm.ListClusterServiceVersion(APIClient, sriovNamespace.Definition.Name) if err != nil || len(csvList) == 0 { return false } - succeeded, err := csvList[0].IsSuccessful() - - return err == nil && succeeded + for _, csv := range csvList { + // Match CSV belonging to the SR-IOV operator package + if csv.Definition.Spec.DisplayName == "" || + !strings.Contains(csv.Definition.Name, "sriov") { + continue + } + succeeded, err := csv.IsSuccessful() + if err == nil && succeeded { + return true + } + } + return false }, 5*time.Minute, tsparams.RetryInterval).Should(BeTrue(), "SR-IOV operator CSV did not reach Succeeded phase within timeout")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ocp/sriov/tests/reinstallation.go` around lines 291 - 300, The code assumes csvList[0] is the operator CSV; instead, filter the results from olm.ListClusterServiceVersion(APIClient, sriovNamespace.Definition.Name) for the expected CSV by name before checking IsSuccessful(). Locate the Subscription for the SR-IOV operator (or derive the expected CSV name from the subscription/package) and then iterate csvList to find the item whose Name (or CSV.Name) matches that expected name; call IsSuccessful() only on that matched CSV and return false if no match is found.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 147-150: The Eventually block containing
sriovoperator.IsSriovDeployed is misformatted; run gofmt to fix indentation and
alignment (e.g., run `gofmt -w tests/ocp/sriov/tests/reinstallation.go` or `go
fmt ./...`) so the call to Eventually(...).WithArguments(APIClient,
SriovOcpConfig.OcpSriovOperatorNamespace).ShouldNot(HaveOccurred(), ...) is
properly indented and matches surrounding code style.
---
Nitpick comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 291-300: The code assumes csvList[0] is the operator CSV; instead,
filter the results from olm.ListClusterServiceVersion(APIClient,
sriovNamespace.Definition.Name) for the expected CSV by name before checking
IsSuccessful(). Locate the Subscription for the SR-IOV operator (or derive the
expected CSV name from the subscription/package) and then iterate csvList to
find the item whose Name (or CSV.Name) matches that expected name; call
IsSuccessful() only on that matched CSV and return false if no match is found.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 229ccf35-4ed2-446d-9f59-96bbbdd4be69
📒 Files selected for processing (1)
tests/ocp/sriov/tests/reinstallation.go
Fix indentation of the Eventually block in test 46532 to match gofmt requirements and unblock CI. Made-with: Cursor
| By("Removing SR-IOV network pool configs") | ||
|
|
||
| // SriovNetworkPoolConfigs carry a finalizer set by the operator controller. | ||
| // They must be deleted while the operator is still running; otherwise the | ||
| // finalizer can never be removed and the namespace gets stuck in Terminating. | ||
| err = sriov.CleanAllPoolConfigs(APIClient, SriovOcpConfig.SriovOperatorNamespace) | ||
| Expect(err).ToNot(HaveOccurred(), "Failed to clean SR-IOV network pool configs") | ||
|
|
There was a problem hiding this comment.
Why do we have the SR-IOV Pool configs on a cluster?
|
I’m not entirely sure why SriovPoolConfigs are present on this cluster, but since they are already there, the changes LGTM. |
Address CodeRabbit review comment: using csvList[0] is fragile if stale or unrelated CSVs exist in the namespace. Filter by CSV name containing "sriov" to deterministically match the SR-IOV operator CSV. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 298-306: The loop over csvList currently returns on the first CSV
with "sriov" in its name, which can cause false negatives; instead iterate all
matching CSVs: for each csv where strings.Contains(csv.Definition.Name, "sriov")
call csv.IsSuccessful(), if any call returns (true, nil) immediately return
true, otherwise continue; after the loop return false if no matching successful
CSV was found (and propagate or log any non-nil errors appropriately). Ensure
you update the logic surrounding csvList and csv.IsSuccessful() to avoid
returning on the first match and only return success when at least one matching
CSV is successful.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f2a8db85-d07f-412b-bb4b-8e88aa5d6369
📒 Files selected for processing (1)
tests/ocp/sriov/tests/reinstallation.go
Two improvements to removeSriovOperator(): 1. Move CleanAllPoolConfigs before sriovOperatorConfig.Delete() so the operator controller is guaranteed to still be running and can process the finalizer removal on SriovNetworkPoolConfig objects. Previously it ran after webhook validation (operator already shutting down), which could leave the finalizer unprocessed and the namespace stuck in Terminating indefinitely. 2. Add an Eventually wait after CleanAllPoolConfigs to confirm pool configs are fully gone (finalizers processed) before namespace deletion begins, preventing a race between finalizer removal and the namespace termination deadline. Confirmed by cluster investigation: sriovnetworkpoolconfig-offload was stuck in Terminating for 14h with finalizer still present after the April 8 test run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/ocp/sriov/tests/reinstallation.go (1)
306-323:⚠️ Potential issue | 🟠 MajorCSV readiness check still short-circuits on first match (flaky).
The loop returns immediately on the first CSV whose name contains
"sriov". If that first match is not yet succeeded, this can fail even when another SR-IOV CSV is already succeeded.💡 Proposed fix
Eventually(func() bool { csvList, err := olm.ListClusterServiceVersion(APIClient, sriovNamespace.Definition.Name) if err != nil || len(csvList) == 0 { return false } for _, csv := range csvList { - if !strings.Contains(csv.Definition.Name, "sriov") { + if !strings.Contains(csv.Definition.Name, "sriov-network-operator") { continue } succeeded, err := csv.IsSuccessful() - - return err == nil && succeeded + if err == nil && succeeded { + return true + } } return false }, 5*time.Minute, tsparams.RetryInterval).Should(BeTrue(), "SR-IOV operator CSV did not reach Succeeded phase within timeout")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ocp/sriov/tests/reinstallation.go` around lines 306 - 323, The readiness check inside the Eventually block short-circuits by returning from within the for loop on the first CSV match; change the logic in the loop that iterates csvList so it does not return immediately: for each csv where strings.Contains(csv.Definition.Name, "sriov"), call csv.IsSuccessful() and if err == nil && succeeded then return true from the Eventually predicate, otherwise continue checking remaining CSVs; only after the loop return false. This touches the Eventually(...) predicate that calls olm.ListClusterServiceVersion, iterates csvList, checks csv.Definition.Name and calls csv.IsSuccessful().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/ocp/sriov/tests/reinstallation.go`:
- Around line 306-323: The readiness check inside the Eventually block
short-circuits by returning from within the for loop on the first CSV match;
change the logic in the loop that iterates csvList so it does not return
immediately: for each csv where strings.Contains(csv.Definition.Name, "sriov"),
call csv.IsSuccessful() and if err == nil && succeeded then return true from the
Eventually predicate, otherwise continue checking remaining CSVs; only after the
loop return false. This touches the Eventually(...) predicate that calls
olm.ListClusterServiceVersion, iterates csvList, checks csv.Definition.Name and
calls csv.IsSuccessful().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9d4a5a6d-b532-495f-8680-b2814d95a7c8
📒 Files selected for processing (1)
tests/ocp/sriov/tests/reinstallation.go
Wait for NetworkAttachmentDefinition to exist after creating SriovNetwork before proceeding to pod creation, fixing a race where the operator creates the NAD asynchronously and the webhook denies pod creation. Preserve the StartingCSV from the original subscription when reinstalling the operator, ensuring OLM pins to the same (working) CSV version rather than resolving the channel head which may not be pullable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The loop returned immediately on the first SR-IOV CSV match, even when IsSuccessful() returned false, preventing subsequent CSVs from being checked. Change to return true only on success and continue iterating otherwise, matching CodeRabbit's review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes three bugs in
reinstallation.gothat caused tests 46530–46532 to fail consistently.Root causes
Test 46530 — namespace stuck
TerminatingremoveSriovOperator()callssriovNamespace.DeleteAndWait()while aSriovNetworkPoolConfig(created by the lab'senable_offload()step) still exists with a finalizerpoolconfig.finalizers.sriovnetwork.openshift. Once namespace deletion starts, the operator controller that would remove the finalizer is also being deleted — a deadlock that leaves the namespace stuck inTerminatingindefinitely.Fix: Add
sriov.CleanAllPoolConfigs()beforeDeleteAndWait(), while the operator is still running and can process the finalizer removal.Tests 46531 — cascades from 46530 — no independent fix needed.
Test 46532 — daemonsets never start after reinstall
installSriovOperator()created theSriovOperatorConfigimmediately after theSubscription, before the operator CSV reachedSucceededand before any controller was running to act on it. As a result the daemonsets (sriov-network-config-daemon,network-resources-injector,operator-webhook) never started.Fix: Add an
Eventuallypoll for CSVSucceededbefore creatingSriovOperatorConfig, mirroring thecreate_default_socstep in the lab'sdeploy_cluster.sh.Test 46532 —
IsSriovDeployedtimeout too shortThe
Eventually(sriovoperator.IsSriovDeployed, time.Minute, ...)check gave only 1 minute. Daemonsets take 60–90+ seconds to come up afterSriovOperatorConfigis created.Fix: Increase timeout from
time.Minuteto5*time.Minute.Test plan
--label-filter="sriovreinstall"againsttests/ocp/sriovon a cluster with the lab'stest-operator-catalogCatalogSource in placeMade with Cursor
Summary by CodeRabbit