tests/ocp/sriov: fix Mellanox CX6-DX switchdev and netdev-to-vfiopci test failures#1302
Conversation
…test failures Two bugs fixed for Mellanox ConnectX-6 DX (vendor 15b3) hardware running in switchdev mode under OpenShift 4.19+: 1. sriovenv.go - verifySpoofCheck: replace MAC-based VF lookup with VF-line pattern matching. In switchdev mode, ip link show <PF> reports all VF MACs as 00:00:00:00:00:00 rather than the actual pod MAC, causing the spoof check verification to always fail. The fix scans any VF line for the expected spoof checking state instead of searching by MAC address. Also force eSwitchMode=legacy on SR-IOV policies to keep the test environment in a predictable state. 2. metricsExporter.go - runMetricsNettoVfioTests: add vendor-aware ICMP assertion for the Netdevice-to-Vfiopci test scenario. On Mellanox NICs, the vfiopci role uses netdevice+RDMA instead of vfio-pci, so the kernel network stack stays active and ICMP succeeds. On Intel NICs (true vfio-pci), the kernel has no VF access and ICMP fails. The fix asserts success for Mellanox (devID == MlxVendorID) and failure otherwise. Tested on a cluster with wsfd-advnetlab244 (Mellanox CX6-DX, 15b3:101d): all three Netdevice-to-Vfiopci cases (Same PF, Different PF, Different Worker) now pass. Made-with: Cursor
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSR-IOV test code updated: VF init now forces eswitch mode to "legacy" after NicSelector filtering; spoof-check verification no longer uses pod MAC and now parses node Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e4bb270 to
6f8e366
Compare
|
|
||
| // For Mellanox NICs, "vfiopci" mode uses netdevice+RDMA instead of vfio-pci (see defineMetricsPolicy). | ||
| // With netdevice+RDMA, the kernel network stack remains active on the VF alongside the DPDK mlx5 PMD, | ||
| // so the kernel still responds to ICMP. With true vfio-pci (Intel), the VF is exclusively owned by | ||
| // DPDK, the kernel has no access, and ICMP fails because testpmd does not respond to it. | ||
| if devID == tsparams.MlxVendorID { | ||
| Eventually(func() error { | ||
| return sriovocpenv.ICMPConnectivityCheck(cPod, []string{tsparams.ServerIPv4IPAddress}, "net1") | ||
| }, 1*time.Minute, 2*time.Second).ShouldNot(HaveOccurred(), | ||
| "ICMP connectivity check failed for Mellanox netdevice+RDMA server") | ||
| } else { | ||
| Eventually(func() error { | ||
| return sriovocpenv.ICMPConnectivityCheck(cPod, []string{tsparams.ServerIPv4IPAddress}, "net1") | ||
| }, 1*time.Minute, 2*time.Second).Should(HaveOccurred(), "ICMP fail scenario could not be executed") | ||
| } |
There was a problem hiding this comment.
I recommend against skipping the traffic test for the Intel device. We should ensure compatibility across both vendors; the defineMetricsPolicy() function should be responsible for generating the correct policy for each.
Address evgenLevin's review comment: instead of re-checking the vendor ID in runMetricsNettoVfioTests, derive the expected ICMP outcome from the device type that defineMetricsPolicy() actually configured on the server policy. This keeps defineMetricsPolicy() as the single source of truth for vendor-specific SR-IOV policy configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Line 313: The build fails because the code accesses a non-existent field
serverResources.sriovPolicy on the metricsTestResource; replace that invalid
field access with the correct field name serverResources.policy (e.g., change
the condition if serverResources.sriovPolicy.Definition.Spec.DeviceType ==
"vfio-pci" to if serverResources.policy.Definition.Spec.DeviceType ==
"vfio-pci"), and keep any necessary nil checks around serverResources.policy and
its Definition before accessing Spec.DeviceType to avoid panics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d0ccb113-aff8-4dfd-8842-0be95588262d
📒 Files selected for processing (1)
tests/ocp/sriov/tests/metricsExporter.go
…reation Create all SriovNetworkNodePolicy resources before creating SriovNetwork resources. The previous interleaved loop (policy1 → network1 → NAD wait → policy2) introduced a ~2s gap that caused the SR-IOV daemon to process policies in separate reconcile generations. The first generation reported "Succeeded" with only one device plugin resource registered, and WaitForSriovStable returned prematurely. The DPDK server pod then failed with a 2-minute timeout waiting for the missing VF resource. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/ocp/sriov/tests/metricsExporter.go (1)
313-313:⚠️ Potential issue | 🔴 CriticalFix invalid field access causing build failure.
The field is named
policy, notsriovPolicy. This breaks the build as confirmed by pipeline failures.Proposed fix
- if serverResources.sriovPolicy.Definition.Spec.DeviceType == "vfio-pci" { + if serverResources.policy.Definition.Spec.DeviceType == "vfio-pci" {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ocp/sriov/tests/metricsExporter.go` at line 313, The code incorrectly accesses serverResources.sriovPolicy but the struct field is named policy; update the conditional to use serverResources.policy.Definition.Spec.DeviceType instead of serverResources.sriovPolicy.Definition.Spec.DeviceType so the build can compile; locate the check in metricsExporter.go (the if that compares DeviceType to "vfio-pci") and replace the field reference accordingly, then run tests/build to verify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Line 313: The code incorrectly accesses serverResources.sriovPolicy but the
struct field is named policy; update the conditional to use
serverResources.policy.Definition.Spec.DeviceType instead of
serverResources.sriovPolicy.Definition.Spec.DeviceType so the build can compile;
locate the check in metricsExporter.go (the if that compares DeviceType to
"vfio-pci") and replace the field reference accordingly, then run tests/build to
verify.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2ead5ff7-4dde-4730-b4f0-97d8fcace1d5
📒 Files selected for processing (1)
tests/ocp/sriov/tests/metricsExporter.go
Replace hardcoded numVfs=5 with SriovOcpConfig.GetVFNum() which reads from the ECO_OCP_SRIOV_VF_NUM environment variable. The hardcoded value mismatched Mellanox CX6-DX firmware NUM_OF_VFS=6, causing mstconfig to attempt a firmware change that requires a cold PCIe boot. Since warm reboot cannot apply the change, the daemon retried indefinitely, exceeding the 35-minute WaitForSriovStable timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/ocp/sriov/tests/exposemtu.go (1)
89-114:⚠️ Potential issue | 🟠 MajorGuard required: this case assumes at least 4 VFs.
With dynamic
vfNum, this test can now fail on valid clusters configured with fewer than 4 VFs because it hardcodes VF ranges#0-1and#2-3. Add an explicit precondition and skip early whenvfNum < 4.💡 Suggested fix
It("netdev 2 Policies with different MTU", reportxml.ID("73788"), func() { + if vfNum < 4 { + Skip(fmt.Sprintf("Skipping test - requires at least 4 VFs, configured: %d", vfNum)) + } + By("Creating 2 SR-IOV policies with 5000 and 9000 MTU for the same interface")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ocp/sriov/tests/exposemtu.go` around lines 89 - 114, This test assumes at least 4 VFs but uses the dynamic variable vfNum, so add an explicit precondition at the start of the It block to skip the test when vfNum < 4; locate the It block (the test titled "netdev 2 Policies with different MTU") and insert a guard like `if vfNum < 4 { Skip("requires at least 4 VFs") }` before creating policies so the subsequent NewPolicyBuilder calls (sriov.NewPolicyBuilder(...).WithDevType(...).WithMTU(...).Create()) won't run on clusters with fewer than 4 VFs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/ocp/sriov/tests/exposemtu.go`:
- Around line 89-114: This test assumes at least 4 VFs but uses the dynamic
variable vfNum, so add an explicit precondition at the start of the It block to
skip the test when vfNum < 4; locate the It block (the test titled "netdev 2
Policies with different MTU") and insert a guard like `if vfNum < 4 {
Skip("requires at least 4 VFs") }` before creating policies so the subsequent
NewPolicyBuilder calls
(sriov.NewPolicyBuilder(...).WithDevType(...).WithMTU(...).Create()) won't run
on clusters with fewer than 4 VFs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eefe19bf-5b23-4741-bccf-d6e6141e9a03
📒 Files selected for processing (1)
tests/ocp/sriov/tests/exposemtu.go
…heck Use the correct struct field name 'policy' instead of 'sriovPolicy' on the metricsTestResource struct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/ocp/sriov/tests/metricsExporter.go (1)
500-525: Redundant NAD-wait after the reorder.Lines 507-510 already call
sriovenv.WaitForNADCreationper network inside the creation loop, and then lines 517-525 run a secondEventually … nad.Pullloop over the same networks. After this PR's restructuring the two blocks do the same thing; you can drop one to simplify the flow. KeepingWaitForNADCreation(the shared helper) is usually preferable.♻️ Proposed cleanup
err := sriovoperator.WaitForSriovAndMCPStable(APIClient, tsparams.MCOWaitTimeout, tsparams.DefaultStableDuration, SriovOcpConfig.MCPLabel, SriovOcpConfig.OcpSriovOperatorNamespace) Expect(err).ToNot(HaveOccurred(), "Failed cluster is not stable before creating test resources") - By("Wait for NAD Creation") - - for _, res := range []metricsTestResource{cRes, sRes} { - Eventually(func() error { - _, err = nad.Pull(APIClient, res.network.Object.Name, tsparams.TestNamespaceName) - - return err - }, 10*time.Second, 1*time.Second).Should(BeNil(), "Failed to pull NAD created by SriovNetwork") - } - By(fmt.Sprintf("Creating %s Pod", cRes.pod.Definition.Name))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ocp/sriov/tests/metricsExporter.go` around lines 500 - 525, The test contains a redundant NAD-wait: you already call sriovenv.WaitForNADCreation for each network inside the creation loop (metricsTestResource items cRes and sRes), so remove the subsequent "By(\"Wait for NAD Creation\")" block that loops over the same resources and calls Eventually with nad.Pull; keep the existing sriovenv.WaitForNADCreation calls and delete the later Eventually/ nad.Pull loop to avoid duplicate waits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Around line 500-525: The test contains a redundant NAD-wait: you already call
sriovenv.WaitForNADCreation for each network inside the creation loop
(metricsTestResource items cRes and sRes), so remove the subsequent "By(\"Wait
for NAD Creation\")" block that loops over the same resources and calls
Eventually with nad.Pull; keep the existing sriovenv.WaitForNADCreation calls
and delete the later Eventually/ nad.Pull loop to avoid duplicate waits.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9c676a19-eac1-4b19-97c1-7626dd00ee11
📒 Files selected for processing (1)
tests/ocp/sriov/tests/metricsExporter.go
Summary
Fixes two test failures observed on Mellanox ConnectX-6 DX (vendor
15b3) hardware running in switchdev mode under OpenShift 4.19+:sriovenv.go—verifySpoofCheck: In switchdev mode,ip link show <PF>reports all VF MACs as00:00:00:00:00:00instead of the actual pod MAC, so the previous MAC-based VF lookup always failed. The fix scans VF lines by the"vf N"prefix pattern instead, and checks any VF line for the expected spoof checking state (since the SR-IOV policy applies a uniform configuration to all VFs). Also setseSwitchMode=legacyon SR-IOV policies to keep the test environment predictable.metricsExporter.go—runMetricsNettoVfioTests: The Netdevice-to-Vfiopci test scenario was asserting that ICMP fails (expecting the VF to be exclusively owned by DPDK/testpmd). However, on Mellanox NICs the"vfiopci"role usesnetdevice+RDMAinstead ofvfio-pci(seedefineMetricsPolicy), so the kernel network stack remains active on the VF and ICMP succeeds. The fix adds a vendor-aware assertion: expect ICMP success for Mellanox (devID == MlxVendorID) and ICMP failure for Intel (truevfio-pci).Test plan
--focus="Netdevice to Vfiopci"with--label-filter="sriovmetrics"againsttests/ocp/sriovon a cluster withwsfd-advnetlab244(Mellanox CX6-DX,15b3:101d, interfaceens6f0np0)eSwitchMode=legacy+ VF-line matching changeMade with Cursor
Summary by CodeRabbit