Skip to content

system-tests: support IPv6 single stack#1335

Open
laurayang842 wants to merge 5 commits intorh-ecosystem-edge:mainfrom
laurayang842:ipv6
Open

system-tests: support IPv6 single stack#1335
laurayang842 wants to merge 5 commits intorh-ecosystem-edge:mainfrom
laurayang842:ipv6

Conversation

@laurayang842
Copy link
Copy Markdown
Contributor

@laurayang842 laurayang842 commented Apr 16, 2026

  1. Updated tests to support IPv6 single stack.
  2. Updated hard-reboot.go and nmi-redfish.go to address the issue that master nodes reboot fast enough that the NotReady state is not observed. The issue is often observed on Dell R660 servers. The tests now check for a boot ID change as proof of reboot.

Summary by CodeRabbit

  • Tests
    • Connectivity checks now run only for the configured IPv4/IPv6 families (single- or dual-stack); service tests set service IP-family policy accordingly
    • IPVLAN, MACVLAN and egress tests iterate over configured address targets, skip empty addresses and log choices
    • Reboot/NMI tests detect recovery by boot-ID change (accepts fast reboots) in addition to NotReady observation
    • Dual-stack-capable test server and stricter pod-level bond IP/subnet validation added

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Tests were changed to detect and exercise configured IP families (IPv4/IPv6) dynamically across egress, service, VLAN/MACVLAN, SR‑IOV/pod‑bond, and whereabouts statefulset/service code. The monitoring server was switched to an IPv6 dual‑stack bind. Soft reboot/NMI flows now use BootID change to detect reboot progress.

Changes

Cohort / File(s) Summary
Egress IP & Service
tests/system-tests/rdscore/internal/rdscorecommon/egress-ip.go, tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go
Egress IP checks build a list of configured IP families and iterate across them (fail if none). Egress service tests set Service.spec.ipFamily from configured remote IPv4/IPv6, add preconditions requiring at least one remote IP, and run curl/source-IP verification only for configured families (IPv4 vs bracketed IPv6).
VLAN / MACVLAN Connectivity
tests/system-tests/rdscore/internal/rdscorecommon/ipvlan-validation.go, tests/system-tests/rdscore/internal/rdscorecommon/macvlan-validation.go
Replaced hardcoded IPv4/IPv6 connectivity calls with loops over configured target addresses; assert at least one family configured per deploy set, skip empty addresses (logged), and invoke connectivity checks per non-empty address while preserving prior source/target label pairing.
SR‑IOV / Pod‑level Bonding
tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go
Enforces at least one IP family configured and requires matching subnet masks; builds IPRequest list dynamically for configured families. Traffic target selection now prefers IPv4 then IPv6 and fails if neither is configured.
Hard / Soft Reboot & NMI
tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go, tests/system-tests/rdscore/internal/rdscorecommon/nmi-redfish.go
Pre-reboot capture records NodeInfo.BootID and logs it. Post-reset/wait logic now uses waitForBootIDChange(..., 25*time.Minute) instead of relying solely on NotReady transitions; Ready polling remains.
Whereabouts StatefulSet / Service IP family
tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go
Added determineIPFamilyPolicy to parse NAD spec.config IPAM ranges and detect IPv4/IPv6 presence; headless Service creation now uses detected ipFamilies and ipFamilyPolicy instead of hardcoded dual‑stack.
Monitoring remote‑write server
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
Embedded Python test server now implements a DualStackTCPServer using AF_INET6 with IPV6_V6ONLY=0 and binds to the IPv6 unspecified address so the server accepts dual‑stack connections.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'system-tests: support IPv6 single stack' accurately reflects the main objective of the pull request, which is to update system tests to support IPv6 single-stack environments alongside the existing IPv4 support.
Docstring Coverage ✅ Passed Docstring coverage is 94.29% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@laurayang842 laurayang842 marked this pull request as ready for review April 16, 2026 02:30
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go (1)

243-244: IPv6 regex may miss addresses starting with ::.

The IPv6 regex pattern requires the address to start with hex digits, so it won't match compressed IPv6 addresses that begin with :: (e.g., ::1/128, ::ffff:192.168.1.1/96). For typical NAD configurations using ULA or link-local addresses, this should be sufficient, but consider extending the pattern if :: prefixed addresses are expected.

♻️ Suggested extended pattern
-	hasIPv6 := regexp.MustCompile(`[0-9a-fA-F]{1,4}(:[0-9a-fA-F]{0,4}){2,}/\d+`).MatchString(config)
+	hasIPv6 := regexp.MustCompile(`([0-9a-fA-F]{0,4}:){2,}[0-9a-fA-F]{0,4}/\d+`).MatchString(config)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go`
around lines 243 - 244, The IPv6 regex assigned to hasIPv6 misses compressed
addresses starting with "::"; update the validation so hasIPv6 recognizes
leading-compressed forms (e.g., ::1/128, ::ffff:192.168.1.1/96) by replacing the
current regex with one that allows an initial "::" compression variant or,
better, switch to parsing using the standard library (e.g., use
net.ParseCIDR/net.ParseIP on the substring) to robustly detect IPv6 CIDRs;
change the logic around the hasIPv6 variable accordingly so both compressed and
full IPv6 addresses are accepted.
tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go (1)

1363-1390: Failover traffic now covers only one family in dual-stack runs.

When both addresses are configured, this always exercises the long-lived failover flow over IPv4. That means the critical “traffic survives VF failover” path is no longer validated for IPv6 in dual-stack environments. Consider running this step once per configured family, or parameterizing the preferred family per spec, so the new IPv6 support is covered during failover too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go`
around lines 1363 - 1390, The test currently picks a single target IP via the
switch (RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 /
PodLevelBondDeploymentTwoIPv6) so dual-stack runs only exercise IPv4 failover;
change the logic in the test around target selection and the call to
generateTCPTraffic so it runs once per configured family instead of choosing
one: iterate over the configured addresses (check
RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 and PodLevelBondDeploymentTwoIPv6),
for each non-empty address set targetIP, log and call
generateTCPTraffic(clientPodObj, targetIP, RDSCoreConfig.PodLevelBondPort, "10",
"5") and assert Expect(err).ToNot(HaveOccurred(...)) per-family; alternatively
add a parameter to control preferred family and run the step for both when both
are present (use the same generateTCPTraffic and logging patterns).
tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go (1)

385-410: Default case creates IPv4 service when no addresses are configured.

When neither remoteTargetIP nor remoteTargetIPv6 is set, the default case still creates an IPv4 SingleStack service. While downstream connectivity tests will fail with a clear assertion, consider failing fast here to provide clearer error messaging.

♻️ Optional: Add explicit validation before IP family switch
 	hasIPv4 := remoteTargetIP != ""
 	hasIPv6 := remoteTargetIPv6 != ""

+	Expect(hasIPv4 || hasIPv6).To(BeTrue(),
+		"At least one remote target IP (IPv4 or IPv6) must be configured")
+
 	switch {
 	case hasIPv4 && hasIPv6:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go` around
lines 385 - 410, The switch's default branch creates an IPv4 SingleStack service
even when neither remoteTargetIP nor remoteTargetIPv6 is provided; update the
logic to validate inputs before the switch (check remoteTargetIP and
remoteTargetIPv6) and fail fast with a clear error/expectation when both are
empty instead of calling svcBuilder.WithIPFamily; locate the decision using the
variables remoteTargetIP and remoteTargetIPv6 and the call sites to
svcBuilder.WithIPFamily (and the switch that sets
corev1.IPFamilyPolicyRequireDualStack / corev1.IPFamilyPolicySingleStack) and
replace the default path with an early return or explicit test failure that
logs/returns a descriptive message about missing target addresses.
tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go (1)

493-499: Consider explicitly enabling dual-stack socket behavior.

The DualStackTCPServer class sets address_family = socket.AF_INET6 and binds to "::", which relies on OS defaults for accepting IPv4 connections via IPv4-mapped IPv6 addresses. On systems with net.ipv6.bindv6only=1 or certain container configurations, this may reject IPv4 connections.

For reliable dual-stack support, explicitly set the IPV6_V6ONLY socket option to False:

♻️ Proposed fix for explicit dual-stack support
 import socket

 class DualStackTCPServer(socketserver.TCPServer):
     address_family = socket.AF_INET6
+
+    def server_bind(self):
+        self.socket.setsockopt(socket.IPPROTO_IPV6, socket.IPV6_V6ONLY, 0)
+        super().server_bind()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go`
around lines 493 - 499, The DualStackTCPServer currently sets address_family =
socket.AF_INET6 and binds to ("::") but doesn’t explicitly disable IPV6_V6ONLY;
update DualStackTCPServer (the class definition) to override server_bind (or
before binding) and call self.socket.setsockopt(socket.IPPROTO_IPV6,
socket.IPV6_V6ONLY, 0) prior to calling the base server_bind so the socket
accepts IPv4-mapped connections; ensure this logic runs before binding the
("::", PORT) socket used with RemoteWriteHandler.
tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go (1)

304-329: Consider adding a brief stabilization delay or explicit Ready verification when boot ID changes.

The conditional logic is sound—if we detected boot ID change without observing NotReady, the node should already be Ready. However, there's an implicit assumption that the node is fully operational at that moment.

This deviates from the SPK implementation (tests/system-tests/spk/internal/spkcommon/reboot.go:277-300) which unconditionally waits for the Ready state. While the deviation is intentional per the PR description (handling Dell R660 fast reboots), consider:

  1. Adding a brief time.Sleep (e.g., 10-15 seconds) after the boot ID change detection to ensure node stability before uncordoning.
  2. Or, explicitly verifying the node is Ready when boot ID changed (even without the full Eventually timeout).

This would provide an additional safety margin against edge cases where the node reports Ready momentarily but hasn't fully stabilized.

♻️ Optional: Add explicit Ready verification for boot-ID-only detection
     if nodeWentNotReady {
         By(fmt.Sprintf("Checking node %q got into Ready", _node.Definition.Name))
 
         Eventually(func(ctx SpecContext) bool {
             // ... existing Ready wait logic
         }).WithTimeout(25*time.Minute).WithPolling(15*time.Second).WithContext(ctx).Should(BeTrue(),
             "Node hasn't reached Ready state")
+    } else {
+        // Boot ID changed without NotReady observation - node rebooted fast
+        // Brief delay to ensure stability before uncordoning
+        klog.V(rdscoreparams.RDSCoreLogLevel).Infof(
+            "Node %q rebooted fast (boot ID changed), allowing brief stabilization", _node.Definition.Name)
+        time.Sleep(15 * time.Second)
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go` around
lines 304 - 329, The block guarded by nodeWentNotReady currently assumes the
node is stable when a boot-ID change was seen; add a short stabilization step
before proceeding (either a small time.Sleep of ~10-15s or an explicit quick
Ready check) to avoid flapping: locate the branch using nodeWentNotReady and
_node.Definition.Name (the Eventually that pulls currentNode via nodes.Pull and
checks condition.Type == rdscoreparams.ConditionTypeReadyString) and insert a
brief pause or an immediate fast verification loop that asserts condition.Status
== rdscoreparams.ConstantTrueString before uncordoning/continuing so the node
has a small margin to stabilize after boot-ID change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/system-tests/rdscore/internal/rdscorecommon/ipvlan-validation.go`:
- Around line 306-344: Add validation (like in
VerifyIPVLANConnectivityBetweenDifferentNodes) to ensure at least one address is
configured before iterating: check RDSCoreConfig.IPVlanDeploy3TargetAddress or
RDSCoreConfig.IPVlanDeploy3TargetAddressIPv6 and if both are empty log an error
(or fail the test) with a clear message and skip/abort the Deploy3 loop; do the
same for RDSCoreConfig.IPVlanDeploy4TargetAddress and
RDSCoreConfig.IPVlanDeploy4TargetAddressIPv6 before the Deploy4 loop so
verifySRIOVConnectivity is only called when a target address exists.
- Around line 161-199: The loop blocks for IPVlan groups lack a check that at
least one target address is configured; add the same validation used in
macvlan-validation.go: for the first group ensure
RDSCoreConfig.IPVlanDeploy1TargetAddress or
RDSCoreConfig.IPVlanDeploy1TargetAddressIPv6 is non-empty and for the second
group ensure RDSCoreConfig.IPVlanDeploy2TargetAddress or
RDSCoreConfig.IPVlanDeploy2TargetAddressIPv6 is non-empty, and if both are empty
fail the test immediately (use the test failure helper in this test suite, e.g.,
t.Fatalf or framework.Failf) before iterating and calling
verifySRIOVConnectivity.

In `@tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go`:
- Around line 265-277: The current validation only checks that an IP address
requires a subnet mask (bondInfIPv4 -> bondInfSubMaskIPv4 and bondInfIPv6 ->
bondInfSubMaskIPv6) but does not reject the inverse (a subnet mask provided
without a corresponding address); update the validation to be symmetric by also
returning an error when bondInfSubMaskIPv4 != "" && bondInfIPv4 == "" and when
bondInfSubMaskIPv6 != "" && bondInfIPv6 == "" so you fail fast on malformed
configs and surface bad test input; keep the existing log messages
(klog.V(100).Infof) and use analogous error text for the added checks.

---

Nitpick comments:
In `@tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go`:
- Around line 385-410: The switch's default branch creates an IPv4 SingleStack
service even when neither remoteTargetIP nor remoteTargetIPv6 is provided;
update the logic to validate inputs before the switch (check remoteTargetIP and
remoteTargetIPv6) and fail fast with a clear error/expectation when both are
empty instead of calling svcBuilder.WithIPFamily; locate the decision using the
variables remoteTargetIP and remoteTargetIPv6 and the call sites to
svcBuilder.WithIPFamily (and the switch that sets
corev1.IPFamilyPolicyRequireDualStack / corev1.IPFamilyPolicySingleStack) and
replace the default path with an early return or explicit test failure that
logs/returns a descriptive message about missing target addresses.

In `@tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go`:
- Around line 304-329: The block guarded by nodeWentNotReady currently assumes
the node is stable when a boot-ID change was seen; add a short stabilization
step before proceeding (either a small time.Sleep of ~10-15s or an explicit
quick Ready check) to avoid flapping: locate the branch using nodeWentNotReady
and _node.Definition.Name (the Eventually that pulls currentNode via nodes.Pull
and checks condition.Type == rdscoreparams.ConditionTypeReadyString) and insert
a brief pause or an immediate fast verification loop that asserts
condition.Status == rdscoreparams.ConstantTrueString before
uncordoning/continuing so the node has a small margin to stabilize after boot-ID
change.

In
`@tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go`:
- Around line 493-499: The DualStackTCPServer currently sets address_family =
socket.AF_INET6 and binds to ("::") but doesn’t explicitly disable IPV6_V6ONLY;
update DualStackTCPServer (the class definition) to override server_bind (or
before binding) and call self.socket.setsockopt(socket.IPPROTO_IPV6,
socket.IPV6_V6ONLY, 0) prior to calling the base server_bind so the socket
accepts IPv4-mapped connections; ensure this logic runs before binding the
("::", PORT) socket used with RemoteWriteHandler.

In `@tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go`:
- Around line 1363-1390: The test currently picks a single target IP via the
switch (RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 /
PodLevelBondDeploymentTwoIPv6) so dual-stack runs only exercise IPv4 failover;
change the logic in the test around target selection and the call to
generateTCPTraffic so it runs once per configured family instead of choosing
one: iterate over the configured addresses (check
RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 and PodLevelBondDeploymentTwoIPv6),
for each non-empty address set targetIP, log and call
generateTCPTraffic(clientPodObj, targetIP, RDSCoreConfig.PodLevelBondPort, "10",
"5") and assert Expect(err).ToNot(HaveOccurred(...)) per-family; alternatively
add a parameter to control preferred family and run the step for both when both
are present (use the same generateTCPTraffic and logging patterns).

In
`@tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go`:
- Around line 243-244: The IPv6 regex assigned to hasIPv6 misses compressed
addresses starting with "::"; update the validation so hasIPv6 recognizes
leading-compressed forms (e.g., ::1/128, ::ffff:192.168.1.1/96) by replacing the
current regex with one that allows an initial "::" compression variant or,
better, switch to parsing using the standard library (e.g., use
net.ParseCIDR/net.ParseIP on the substring) to robustly detect IPv6 CIDRs;
change the logic around the hasIPv6 variable accordingly so both compressed and
full IPv6 addresses are accepted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 52ac1979-e292-4969-af4b-b5767d0a9a5a

📥 Commits

Reviewing files that changed from the base of the PR and between 576df61 and 6621e20.

📒 Files selected for processing (8)
  • tests/system-tests/rdscore/internal/rdscorecommon/egress-ip.go
  • tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go
  • tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go
  • tests/system-tests/rdscore/internal/rdscorecommon/ipvlan-validation.go
  • tests/system-tests/rdscore/internal/rdscorecommon/macvlan-validation.go
  • tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
  • tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go
  • tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go

Comment thread tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go (1)

1373-1412: Move targetIP selection (and its guard) out of the goroutine.

Determining targetIP inside the goroutine forces reliance on GinkgoRecover to surface the "neither IPv4 nor IPv6 configured" failure, and the return at line 1387 is unreachable because Fail(...) panics. In practice the default branch is also effectively dead — the preceding verifyConnectivity() call at line 1343 would have already asserted at least one server IP via lines 1049‑1050 — so this guard only obscures control flow.

Resolving targetIP before launching the goroutine makes the failure mode explicit on the main test goroutine and removes the dead return.

♻️ Suggested refactor
+	// Determine target IP based on configuration priority: IPv4 first, then IPv6
+	var targetIP string
+
+	switch {
+	case RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 != "":
+		targetIP = RDSCoreConfig.PodLevelBondDeploymentTwoIPv4
+	case RDSCoreConfig.PodLevelBondDeploymentTwoIPv6 != "":
+		targetIP = RDSCoreConfig.PodLevelBondDeploymentTwoIPv6
+	default:
+		Fail("Neither IPv4 nor IPv6 server address is configured for the server deployment")
+	}
+
 	go func() {
 		defer GinkgoRecover()
 
-		// Determine target IP based on configuration priority: IPv4 first, then IPv6
-		var targetIP string
-
-		switch {
-		case RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 != "":
-			targetIP = RDSCoreConfig.PodLevelBondDeploymentTwoIPv4
-		case RDSCoreConfig.PodLevelBondDeploymentTwoIPv6 != "":
-			targetIP = RDSCoreConfig.PodLevelBondDeploymentTwoIPv6
-		default:
-			Fail("Neither IPv4 nor IPv6 server address is configured for the server deployment")
-
-			return
-		}
-
 		By(fmt.Sprintf("Send data from the client container to the server address %s", targetIP))
 
 		klog.V(100).Infof("Using server address %s for TCP traffic generation", targetIP)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go`
around lines 1373 - 1412, The goroutine currently selects targetIP and guards
with Fail/return inside the anonymous func (which relies on GinkgoRecover and
contains a dead return), so move the selection and guard for targetIP out of the
go func and perform it on the main test goroutine before launching the
goroutine; compute targetIP using the same switch (checking
RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 then PodLevelBondDeploymentTwoIPv6),
call Fail(...) on the main goroutine if neither is set, then pass the resolved
targetIP into the goroutine and keep the rest of the logic (including calls to
generateTCPTraffic and scanClientPodTrafficOutput) unchanged.
tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go (2)

245-246: Regexes run against raw JSON — note false-positive surface.

Both patterns are applied to the entire spec.config string (JSON), not only to ipRanges/subnet values. The IPv4 pattern \d+\.\d+\.\d+\.\d+/\d+ could match any dotted-quad-with-slash that happens to appear elsewhere in the config (e.g., a route string, a comment-like field, or a gateway entry embedded in another plugin section). For a more robust detection, consider unmarshalling the config JSON and inspecting ipam.ipRanges[*].range / ipam.range directly. Low priority given current NAD shapes are controlled, but worth tightening if NADs diversify.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go`
around lines 245 - 246, The current detection uses regexes on the raw
spec.config string (variables hasIPv4/hasIPv6) which can yield false positives;
instead unmarshal spec.config JSON and inspect the IPAM fields directly (check
ipam.ipRanges[*].range and ipam.range/subnet entries) to determine presence of
IPv4/IPv6 ranges, then set hasIPv4/hasIPv6 accordingly; update the code around
the MatchString usage to parse with encoding/json, traverse to the ipam
structure, and only apply CIDR detection to those specific fields.

245-258: Consider failing fast when no IP family is detected.

Falling through to IPv4 as the default when neither regex matches can silently mask a malformed/unexpected NAD spec.config (e.g., a NAD that doesn't embed CIDRs in the top-level config string, such as when ipRanges live under a different key/structure), leading to a mis-configured Service downstream. Consider returning an error/Fail when both hasIPv4 and hasIPv6 are false so the test surfaces the misconfiguration instead of defaulting silently.

Proposed change
 	switch {
 	case hasIPv4 && hasIPv6:
 		return []corev1.IPFamily{corev1.IPv4Protocol, corev1.IPv6Protocol}, corev1.IPFamilyPolicyRequireDualStack
 	case hasIPv6:
 		return []corev1.IPFamily{corev1.IPv6Protocol}, corev1.IPFamilyPolicySingleStack
-	default:
+	case hasIPv4:
 		return []corev1.IPFamily{corev1.IPv4Protocol}, corev1.IPFamilyPolicySingleStack
+	default:
+		Fail(fmt.Sprintf("NAD %q spec.config contains no detectable IPv4 or IPv6 CIDR", nadName))
+		return nil, ""
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go`
around lines 245 - 258, The code currently defaults to IPv4 when neither hasIPv4
nor hasIPv6 is true, which can hide malformed NAD config; change the logic so
when both hasIPv4 and hasIPv6 are false the test fails fast instead of returning
IPv4: update the switch/default branch that checks hasIPv4/hasIPv6 to return an
error or call test fail (e.g., t.Fatalf or require.Fail) with a clear message
including nadName and config, or change the function signature to return ([],
policy, error) and propagate the error—modify the branches around the
hasIPv4/hasIPv6 checks and the default case accordingly.
tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go (1)

301-302: Include node name in the failure message.

When this assertion fails across many nodes in a suite run, "Node hasn't reached notReady state and boot ID hasn't changed" makes triage harder than it needs to be. Interpolate the node name (as the Ready-wait message should as well).

📝 Suggested tweak
 		}).WithTimeout(25*time.Minute).WithPolling(15*time.Second).WithContext(ctx).Should(BeTrue(),
-			"Node hasn't reached notReady state and boot ID hasn't changed")
+			fmt.Sprintf("Node %q hasn't reached notReady state and boot ID hasn't changed",
+				_node.Definition.Name))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go` around
lines 301 - 302, The assertion call that ends with
.WithTimeout(...).WithPolling(...).WithContext(ctx).Should(BeTrue(), "Node
hasn't reached notReady state and boot ID hasn't changed") should include the
node identifier to aid triage; update the failure message passed to
Should(BeTrue, ...) to interpolate the node name (e.g. use node.Name or
nodeName) so it reads something like "Node <nodeName> hasn't reached notReady
state and boot ID hasn't changed"; keep the change localized to the Should(...)
call used alongside WithTimeout/WithPolling/WithContext so the Ready-wait
messages remain consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go`:
- Around line 304-335: The current fast-reboot branch skips the Ready wait
(nodeWentNotReady false path) and only sleeps 15s before uncordoning; change so
we always wait for the node to reach Ready before uncordoning. Replace the else
branch (the 15s sleep) with the same Eventually readiness check used in the
nodeWentNotReady true branch (or extract that check into a helper like
waitForNodeReady(name, ctx) and call it from both branches), referencing
currentNode.Object.Status.Conditions and rdscoreparams.ConditionTypeReadyString
/ ConstantTrueString to detect readiness; keep the existing logging around
nodeWentNotReady for informational purposes if desired.

---

Nitpick comments:
In `@tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go`:
- Around line 301-302: The assertion call that ends with
.WithTimeout(...).WithPolling(...).WithContext(ctx).Should(BeTrue(), "Node
hasn't reached notReady state and boot ID hasn't changed") should include the
node identifier to aid triage; update the failure message passed to
Should(BeTrue, ...) to interpolate the node name (e.g. use node.Name or
nodeName) so it reads something like "Node <nodeName> hasn't reached notReady
state and boot ID hasn't changed"; keep the change localized to the Should(...)
call used alongside WithTimeout/WithPolling/WithContext so the Ready-wait
messages remain consistent.

In `@tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go`:
- Around line 1373-1412: The goroutine currently selects targetIP and guards
with Fail/return inside the anonymous func (which relies on GinkgoRecover and
contains a dead return), so move the selection and guard for targetIP out of the
go func and perform it on the main test goroutine before launching the
goroutine; compute targetIP using the same switch (checking
RDSCoreConfig.PodLevelBondDeploymentTwoIPv4 then PodLevelBondDeploymentTwoIPv6),
call Fail(...) on the main goroutine if neither is set, then pass the resolved
targetIP into the goroutine and keep the rest of the logic (including calls to
generateTCPTraffic and scanClientPodTrafficOutput) unchanged.

In
`@tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go`:
- Around line 245-246: The current detection uses regexes on the raw spec.config
string (variables hasIPv4/hasIPv6) which can yield false positives; instead
unmarshal spec.config JSON and inspect the IPAM fields directly (check
ipam.ipRanges[*].range and ipam.range/subnet entries) to determine presence of
IPv4/IPv6 ranges, then set hasIPv4/hasIPv6 accordingly; update the code around
the MatchString usage to parse with encoding/json, traverse to the ipam
structure, and only apply CIDR detection to those specific fields.
- Around line 245-258: The code currently defaults to IPv4 when neither hasIPv4
nor hasIPv6 is true, which can hide malformed NAD config; change the logic so
when both hasIPv4 and hasIPv6 are false the test fails fast instead of returning
IPv4: update the switch/default branch that checks hasIPv4/hasIPv6 to return an
error or call test fail (e.g., t.Fatalf or require.Fail) with a clear message
including nadName and config, or change the function signature to return ([],
policy, error) and propagate the error—modify the branches around the
hasIPv4/hasIPv6 checks and the default case accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8100d46d-b8a5-4346-bca7-0075d3922202

📥 Commits

Reviewing files that changed from the base of the PR and between 6621e20 and c273fa5.

📒 Files selected for processing (6)
  • tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go
  • tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go
  • tests/system-tests/rdscore/internal/rdscorecommon/ipvlan-validation.go
  • tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
  • tests/system-tests/rdscore/internal/rdscorecommon/sriov-pod-level-bond.go
  • tests/system-tests/rdscore/internal/rdscorecommon/whereabouts-statefulset.go
✅ Files skipped from review due to trivial changes (1)
  • tests/system-tests/rdscore/internal/rdscorecommon/egress-service.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/system-tests/rdscore/internal/rdscorecommon/monitoring-config-validation.go
  • tests/system-tests/rdscore/internal/rdscorecommon/ipvlan-validation.go

Comment thread tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/system-tests/rdscore/internal/rdscorecommon/nmi-redfish.go (1)

113-142: Consider tightening the post-BootID-change Ready wait.

Once waitForBootIDChange returns (up to 25 min), the node has demonstrably rebooted, so the subsequent Eventually for Ready state (lines 120-142) effectively only needs to cover kubelet re-registration time. The current 25-minute timeout on both stages yields a worst-case wall time of ~50 minutes per node, which may mask real regressions and slow CI. A shorter Ready timeout (e.g., 10 minutes) after BootID change would be more diagnostic without sacrificing reliability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/system-tests/rdscore/internal/rdscorecommon/nmi-redfish.go` around
lines 113 - 142, After waitForBootIDChange(...) returns, shorten the subsequent
Eventually timeout that waits for the node Ready condition from 25*time.Minute
to a smaller value (suggest 10*time.Minute) so we only cover kubelet
re-registration; update the WithTimeout(...) call on the Eventually block that
checks currentNode.Object.Status.Conditions (the block referencing
node.Definition.Name, rdscoreparams.ConditionTypeReadyString and
rdscoreparams.ConstantTrueString) to use 10*time.Minute (or a configurable
constant) instead of 25*time.Minute.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/system-tests/rdscore/internal/rdscorecommon/nmi-redfish.go`:
- Around line 113-142: After waitForBootIDChange(...) returns, shorten the
subsequent Eventually timeout that waits for the node Ready condition from
25*time.Minute to a smaller value (suggest 10*time.Minute) so we only cover
kubelet re-registration; update the WithTimeout(...) call on the Eventually
block that checks currentNode.Object.Status.Conditions (the block referencing
node.Definition.Name, rdscoreparams.ConditionTypeReadyString and
rdscoreparams.ConstantTrueString) to use 10*time.Minute (or a configurable
constant) instead of 25*time.Minute.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bc5c7789-a6d7-4437-96ba-168d059aef08

📥 Commits

Reviewing files that changed from the base of the PR and between bf1b447 and 089bf09.

📒 Files selected for processing (2)
  • tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go
  • tests/system-tests/rdscore/internal/rdscorecommon/nmi-redfish.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/system-tests/rdscore/internal/rdscorecommon/hard-reboot.go

Copy link
Copy Markdown
Collaborator

@yprokule yprokule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split this into multiple PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants