Skip to content

tests/ocp/sriov: fix Mellanox CX6-DX switchdev and netdev-to-vfiopci test failures#1302

Open
zhiqiangf wants to merge 5 commits intorh-ecosystem-edge:mainfrom
zhiqiangf:fix-sriov-mellanox-switchdev-and-nettoVfiopci
Open

tests/ocp/sriov: fix Mellanox CX6-DX switchdev and netdev-to-vfiopci test failures#1302
zhiqiangf wants to merge 5 commits intorh-ecosystem-edge:mainfrom
zhiqiangf:fix-sriov-mellanox-switchdev-and-nettoVfiopci

Conversation

@zhiqiangf
Copy link
Copy Markdown
Contributor

@zhiqiangf zhiqiangf commented Mar 27, 2026

Summary

Fixes two test failures observed on Mellanox ConnectX-6 DX (vendor 15b3) hardware running in switchdev mode under OpenShift 4.19+:

  • sriovenv.goverifySpoofCheck: In switchdev mode, ip link show <PF> reports all VF MACs as 00:00:00:00:00:00 instead of the actual pod MAC, so the previous MAC-based VF lookup always failed. The fix scans VF lines by the "vf N" prefix pattern instead, and checks any VF line for the expected spoof checking state (since the SR-IOV policy applies a uniform configuration to all VFs). Also sets eSwitchMode=legacy on SR-IOV policies to keep the test environment predictable.

  • metricsExporter.gorunMetricsNettoVfioTests: The Netdevice-to-Vfiopci test scenario was asserting that ICMP fails (expecting the VF to be exclusively owned by DPDK/testpmd). However, on Mellanox NICs the "vfiopci" role uses netdevice+RDMA instead of vfio-pci (see defineMetricsPolicy), so the kernel network stack remains active on the VF and ICMP succeeds. The fix adds a vendor-aware assertion: expect ICMP success for Mellanox (devID == MlxVendorID) and ICMP failure for Intel (true vfio-pci).

Test plan

  • Ran --focus="Netdevice to Vfiopci" with --label-filter="sriovmetrics" against tests/ocp/sriov on a cluster with wsfd-advnetlab244 (Mellanox CX6-DX, 15b3:101d, interface ens6f0np0)
  • All three Netdevice-to-Vfiopci cases passed: Same PF, Different PF, Different Worker
  • Basic tests (spoof check, trust, VLAN, link state, MTU, DPDK) continued to pass with the eSwitchMode=legacy + VF-line matching change

Made with Cursor

Summary by CodeRabbit

  • Tests
    • Spoof-check verification no longer relies on pod MACs; it detects VF presence and spoof state per interface and returns clearer, interface-scoped errors and logs.
    • Metrics/ICMP expectations now vary by device type (failure for vfio-pci, success otherwise).
    • Expose‑MTU tests now load and use the actual VF count instead of a hardcoded value.
    • ESwitch mode is explicitly set to "legacy" when initializing selected VFs.
  • Chores
    • Test setup now creates SR‑IOV policies before networks and includes resource names in logs.

…test failures

Two bugs fixed for Mellanox ConnectX-6 DX (vendor 15b3) hardware running
in switchdev mode under OpenShift 4.19+:

1. sriovenv.go - verifySpoofCheck: replace MAC-based VF lookup with VF-line
   pattern matching. In switchdev mode, ip link show <PF> reports all VF
   MACs as 00:00:00:00:00:00 rather than the actual pod MAC, causing the
   spoof check verification to always fail. The fix scans any VF line for
   the expected spoof checking state instead of searching by MAC address.
   Also force eSwitchMode=legacy on SR-IOV policies to keep the test
   environment in a predictable state.

2. metricsExporter.go - runMetricsNettoVfioTests: add vendor-aware ICMP
   assertion for the Netdevice-to-Vfiopci test scenario. On Mellanox NICs,
   the vfiopci role uses netdevice+RDMA instead of vfio-pci, so the kernel
   network stack stays active and ICMP succeeds. On Intel NICs (true
   vfio-pci), the kernel has no VF access and ICMP fails. The fix asserts
   success for Mellanox (devID == MlxVendorID) and failure otherwise.

Tested on a cluster with wsfd-advnetlab244 (Mellanox CX6-DX, 15b3:101d):
all three Netdevice-to-Vfiopci cases (Same PF, Different PF, Different
Worker) now pass.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

SR-IOV test code updated: VF init now forces eswitch mode to "legacy" after NicSelector filtering; spoof-check verification no longer uses pod MAC and now parses node ip link VF lines; ICMP connectivity expectations are conditional on the server policy DeviceType ("vfio-pci" vs others); VF count is loaded once and propagated to MTU tests.

Changes

Cohort / File(s) Summary
SR-IOV VF Initialization & Verification
tests/ocp/sriov/internal/sriovenv/sriovenv.go
initVFWithDevType: after applying NicSelector filters (Vendor, DeviceID) the SR-IOV policy spec now sets EswitchMode = "legacy" before policy.Create(). verifySpoofCheck: removed pod MAC extraction and MAC-based matching/messages; parses ip link show <interfaceName> output to count VF lines and validate spoof checking / spoofchk state on any matching VF line. Returns a distinct error when no VF lines are found, otherwise "spoof check not found" scoped to the interface; success logs reference interface name (no MAC).
Metrics exporter: test flow & ICMP assertions
tests/ocp/sriov/tests/metricsExporter.go
runMetricsNettoVfioTests: ICMP connectivity assertion is conditional on serverResources.policy.Definition.Spec.DeviceType — expects failure when DeviceType == "vfio-pci", otherwise expects success. createMetricsTestResources: create SR-IOV node policies for both resources first, then create SR-IOV networks for both; By(...) logs updated to include resource names and waits for NAD creation for each network.
Expose MTU tests: VF count dynamic loading
tests/ocp/sriov/tests/exposemtu.go
Load VF count in BeforeAll via SriovOcpConfig.GetVFNum() into vfNum with assertion. testExposeMTU now accepts vfsNumber; all callers and SR-IOV policy creations updated to use vfNum instead of hardcoded 5.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main fixes in the PR: SR-IOV test failures on Mellanox CX6-DX hardware in switchdev mode, with specific mention of the two key test fixes (switchdev and netdev-to-vfiopci).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@zhiqiangf zhiqiangf force-pushed the fix-sriov-mellanox-switchdev-and-nettoVfiopci branch from e4bb270 to 6f8e366 Compare March 27, 2026 00:22
@zhiqiangf zhiqiangf marked this pull request as draft March 27, 2026 03:10
Comment on lines +307 to +321

// For Mellanox NICs, "vfiopci" mode uses netdevice+RDMA instead of vfio-pci (see defineMetricsPolicy).
// With netdevice+RDMA, the kernel network stack remains active on the VF alongside the DPDK mlx5 PMD,
// so the kernel still responds to ICMP. With true vfio-pci (Intel), the VF is exclusively owned by
// DPDK, the kernel has no access, and ICMP fails because testpmd does not respond to it.
if devID == tsparams.MlxVendorID {
Eventually(func() error {
return sriovocpenv.ICMPConnectivityCheck(cPod, []string{tsparams.ServerIPv4IPAddress}, "net1")
}, 1*time.Minute, 2*time.Second).ShouldNot(HaveOccurred(),
"ICMP connectivity check failed for Mellanox netdevice+RDMA server")
} else {
Eventually(func() error {
return sriovocpenv.ICMPConnectivityCheck(cPod, []string{tsparams.ServerIPv4IPAddress}, "net1")
}, 1*time.Minute, 2*time.Second).Should(HaveOccurred(), "ICMP fail scenario could not be executed")
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend against skipping the traffic test for the Intel device. We should ensure compatibility across both vendors; the defineMetricsPolicy() function should be responsible for generating the correct policy for each.

Address evgenLevin's review comment: instead of re-checking the vendor ID
in runMetricsNettoVfioTests, derive the expected ICMP outcome from the
device type that defineMetricsPolicy() actually configured on the server
policy. This keeps defineMetricsPolicy() as the single source of truth
for vendor-specific SR-IOV policy configuration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@zhiqiangf zhiqiangf marked this pull request as ready for review April 9, 2026 15:03
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Line 313: The build fails because the code accesses a non-existent field
serverResources.sriovPolicy on the metricsTestResource; replace that invalid
field access with the correct field name serverResources.policy (e.g., change
the condition if serverResources.sriovPolicy.Definition.Spec.DeviceType ==
"vfio-pci" to if serverResources.policy.Definition.Spec.DeviceType ==
"vfio-pci"), and keep any necessary nil checks around serverResources.policy and
its Definition before accessing Spec.DeviceType to avoid panics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d0ccb113-aff8-4dfd-8842-0be95588262d

📥 Commits

Reviewing files that changed from the base of the PR and between 6f8e366 and 054e4e1.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/metricsExporter.go

Comment thread tests/ocp/sriov/tests/metricsExporter.go Outdated
…reation

Create all SriovNetworkNodePolicy resources before creating SriovNetwork
resources. The previous interleaved loop (policy1 → network1 → NAD wait →
policy2) introduced a ~2s gap that caused the SR-IOV daemon to process
policies in separate reconcile generations. The first generation reported
"Succeeded" with only one device plugin resource registered, and
WaitForSriovStable returned prematurely. The DPDK server pod then failed
with a 2-minute timeout waiting for the missing VF resource.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/ocp/sriov/tests/metricsExporter.go (1)

313-313: ⚠️ Potential issue | 🔴 Critical

Fix invalid field access causing build failure.

The field is named policy, not sriovPolicy. This breaks the build as confirmed by pipeline failures.

Proposed fix
-	if serverResources.sriovPolicy.Definition.Spec.DeviceType == "vfio-pci" {
+	if serverResources.policy.Definition.Spec.DeviceType == "vfio-pci" {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ocp/sriov/tests/metricsExporter.go` at line 313, The code incorrectly
accesses serverResources.sriovPolicy but the struct field is named policy;
update the conditional to use serverResources.policy.Definition.Spec.DeviceType
instead of serverResources.sriovPolicy.Definition.Spec.DeviceType so the build
can compile; locate the check in metricsExporter.go (the if that compares
DeviceType to "vfio-pci") and replace the field reference accordingly, then run
tests/build to verify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Line 313: The code incorrectly accesses serverResources.sriovPolicy but the
struct field is named policy; update the conditional to use
serverResources.policy.Definition.Spec.DeviceType instead of
serverResources.sriovPolicy.Definition.Spec.DeviceType so the build can compile;
locate the check in metricsExporter.go (the if that compares DeviceType to
"vfio-pci") and replace the field reference accordingly, then run tests/build to
verify.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ead5ff7-4dde-4730-b4f0-97d8fcace1d5

📥 Commits

Reviewing files that changed from the base of the PR and between 054e4e1 and b498490.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/metricsExporter.go

Replace hardcoded numVfs=5 with SriovOcpConfig.GetVFNum() which reads
from the ECO_OCP_SRIOV_VF_NUM environment variable. The hardcoded value
mismatched Mellanox CX6-DX firmware NUM_OF_VFS=6, causing mstconfig to
attempt a firmware change that requires a cold PCIe boot. Since warm
reboot cannot apply the change, the daemon retried indefinitely,
exceeding the 35-minute WaitForSriovStable timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/ocp/sriov/tests/exposemtu.go (1)

89-114: ⚠️ Potential issue | 🟠 Major

Guard required: this case assumes at least 4 VFs.

With dynamic vfNum, this test can now fail on valid clusters configured with fewer than 4 VFs because it hardcodes VF ranges #0-1 and #2-3. Add an explicit precondition and skip early when vfNum < 4.

💡 Suggested fix
 It("netdev 2 Policies with different MTU", reportxml.ID("73788"), func() {
+	if vfNum < 4 {
+		Skip(fmt.Sprintf("Skipping test - requires at least 4 VFs, configured: %d", vfNum))
+	}
+
 	By("Creating 2 SR-IOV policies with 5000 and 9000 MTU for the same interface")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ocp/sriov/tests/exposemtu.go` around lines 89 - 114, This test assumes
at least 4 VFs but uses the dynamic variable vfNum, so add an explicit
precondition at the start of the It block to skip the test when vfNum < 4;
locate the It block (the test titled "netdev 2 Policies with different MTU") and
insert a guard like `if vfNum < 4 { Skip("requires at least 4 VFs") }` before
creating policies so the subsequent NewPolicyBuilder calls
(sriov.NewPolicyBuilder(...).WithDevType(...).WithMTU(...).Create()) won't run
on clusters with fewer than 4 VFs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/ocp/sriov/tests/exposemtu.go`:
- Around line 89-114: This test assumes at least 4 VFs but uses the dynamic
variable vfNum, so add an explicit precondition at the start of the It block to
skip the test when vfNum < 4; locate the It block (the test titled "netdev 2
Policies with different MTU") and insert a guard like `if vfNum < 4 {
Skip("requires at least 4 VFs") }` before creating policies so the subsequent
NewPolicyBuilder calls
(sriov.NewPolicyBuilder(...).WithDevType(...).WithMTU(...).Create()) won't run
on clusters with fewer than 4 VFs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eefe19bf-5b23-4741-bccf-d6e6141e9a03

📥 Commits

Reviewing files that changed from the base of the PR and between b498490 and 7a0e1fe.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/exposemtu.go

…heck

Use the correct struct field name 'policy' instead of 'sriovPolicy' on
the metricsTestResource struct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/ocp/sriov/tests/metricsExporter.go (1)

500-525: Redundant NAD-wait after the reorder.

Lines 507-510 already call sriovenv.WaitForNADCreation per network inside the creation loop, and then lines 517-525 run a second Eventually … nad.Pull loop over the same networks. After this PR's restructuring the two blocks do the same thing; you can drop one to simplify the flow. Keeping WaitForNADCreation (the shared helper) is usually preferable.

♻️ Proposed cleanup
 	err := sriovoperator.WaitForSriovAndMCPStable(APIClient, tsparams.MCOWaitTimeout, tsparams.DefaultStableDuration,
 		SriovOcpConfig.MCPLabel, SriovOcpConfig.OcpSriovOperatorNamespace)
 	Expect(err).ToNot(HaveOccurred(), "Failed cluster is not stable before creating test resources")
 
-	By("Wait for NAD Creation")
-
-	for _, res := range []metricsTestResource{cRes, sRes} {
-		Eventually(func() error {
-			_, err = nad.Pull(APIClient, res.network.Object.Name, tsparams.TestNamespaceName)
-
-			return err
-		}, 10*time.Second, 1*time.Second).Should(BeNil(), "Failed to pull NAD created by SriovNetwork")
-	}
-
 	By(fmt.Sprintf("Creating %s Pod", cRes.pod.Definition.Name))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ocp/sriov/tests/metricsExporter.go` around lines 500 - 525, The test
contains a redundant NAD-wait: you already call sriovenv.WaitForNADCreation for
each network inside the creation loop (metricsTestResource items cRes and sRes),
so remove the subsequent "By(\"Wait for NAD Creation\")" block that loops over
the same resources and calls Eventually with nad.Pull; keep the existing
sriovenv.WaitForNADCreation calls and delete the later Eventually/ nad.Pull loop
to avoid duplicate waits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/ocp/sriov/tests/metricsExporter.go`:
- Around line 500-525: The test contains a redundant NAD-wait: you already call
sriovenv.WaitForNADCreation for each network inside the creation loop
(metricsTestResource items cRes and sRes), so remove the subsequent "By(\"Wait
for NAD Creation\")" block that loops over the same resources and calls
Eventually with nad.Pull; keep the existing sriovenv.WaitForNADCreation calls
and delete the later Eventually/ nad.Pull loop to avoid duplicate waits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9c676a19-eac1-4b19-97c1-7626dd00ee11

📥 Commits

Reviewing files that changed from the base of the PR and between 7a0e1fe and 1b54f2a.

📒 Files selected for processing (1)
  • tests/ocp/sriov/tests/metricsExporter.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants