Skip to content

fix: Reject reserved IP addresses on interface update#223

Open
fabiendupont wants to merge 1 commit intoNVIDIA:mainfrom
fabiendupont:fix/ipam-reject-reserved-ip-addresses
Open

fix: Reject reserved IP addresses on interface update#223
fabiendupont wants to merge 1 commit intoNVIDIA:mainfrom
fabiendupont:fix/ipam-reject-reserved-ip-addresses

Conversation

@fabiendupont
Copy link
Collaborator

Summary

  • Add ValidateIPAddresses utility that checks IPs against a CIDR prefix, rejecting network address, broadcast address (IPv4), gateway address, out-of-prefix IPs, and unparseable strings
  • Add GetInterfaceCIDRAndGateway helper to extract prefix info from an interface's Subnet or VpcPrefix relation
  • Validate IPs in UpdateInstancesInDB before persisting them — when reserved IPs are detected, the interface is set to Error status and invalid addresses are not stored
  • Self-healing: the interface recovers automatically on the next inventory cycle once the Site Controller reports valid IPs

Context

The Site Controller allocated a broadcast address to an instance within a VPC Prefix, which was persisted without validation, causing network issues. The embedded IPAM library already blocks network/broadcast on AcquireIP(), but the interface IP write path from inventory bypassed that check entirely.

Test plan

  • 12 unit tests for ValidateIPAddresses covering IPv4/IPv6, network/broadcast/gateway rejection, out-of-prefix, invalid strings, small prefixes, empty lists, invalid CIDR fallback
  • 3 unit tests for GetInterfaceCIDRAndGateway covering Subnet, VpcPrefix, and nil cases
  • Integration test: instance19 receives broadcast IP (192.172.0.255) for a 192.172.0.0/24 VPC Prefix interface — verifies interface is set to Error status with no IPs stored

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings March 10, 2026 13:01
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a security/correctness issue where the Site Controller could assign reserved IP addresses (network address, broadcast address, or gateway) to instances in VPC Prefix interfaces, and those invalid IPs would be persisted to the database without validation. It adds a ValidateIPAddresses utility and a GetInterfaceCIDRAndGateway helper in the IPAM package, then calls them during the UpdateInstancesInDB inventory processing loop.

Changes:

  • New ValidateIPAddresses function in the IPAM package that rejects network, broadcast, gateway, out-of-prefix, and unparseable addresses
  • New GetInterfaceCIDRAndGateway function that extracts the prefix CIDR and gateway from a Subnet or VpcPrefix relation
  • Integration into UpdateInstancesInDB — interfaces with rejected IPs are set to Error status and IPs are not persisted

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
db/pkg/db/ipam/ipam.go Adds ValidateIPAddresses and GetInterfaceCIDRAndGateway utilities
db/pkg/db/ipam/ipam_test.go Adds unit tests for both new functions
workflow/pkg/activity/instance/instance.go Integrates IP validation into the inventory update path
workflow/pkg/activity/instance/instance_test.go Adds integration test for the broadcast IP rejection scenario

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +983 to +992
t.Run("vpc prefix interface", func(t *testing.T) {
ifc := &cdbm.Interface{
VpcPrefix: &cdbm.VpcPrefix{
Prefix: "10.0.0.0",
PrefixLength: 16,
},
}
cidr, gw := GetInterfaceCIDRAndGateway(ifc)
assert.Equal(t, "10.0.0.0/16", cidr)
assert.Equal(t, "", gw)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test for GetInterfaceCIDRAndGateway with a VPC Prefix interface uses Prefix: "10.0.0.0" (a bare IP address), but in production the VpcPrefix.Prefix field contains CIDR notation like "10.0.0.0/16" (set from childPrefix.Cidr in vpcprefix.go:263). The test should use the actual production format ("10.0.0.0/16") to properly validate this code path, and the expected result should be adjusted accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +307 to +311
if ifc.Subnet != nil && ifc.Subnet.IPv4Prefix != nil {
cidr = fmt.Sprintf("%s/%d", *ifc.Subnet.IPv4Prefix, ifc.Subnet.PrefixLength)
if ifc.Subnet.IPv4Gateway != nil {
gateway = *ifc.Subnet.IPv4Gateway
}
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GetInterfaceCIDRAndGateway function only extracts the CIDR from the Subnet's IPv4Prefix field and does not consider IPv6Prefix. This means subnet interfaces with only an IPv6 prefix will have CIDR returned as empty string, causing the IP validation in UpdateInstancesInDB to be silently skipped. If IPv6 subnets are in use or planned, this should be addressed to prevent reserved IPv6 addresses from being stored.

Copilot uses AI. Check for mistakes.
Comment on lines +312 to +313
} else if ifc.VpcPrefix != nil {
cidr = fmt.Sprintf("%s/%d", ifc.VpcPrefix.Prefix, ifc.VpcPrefix.PrefixLength)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GetInterfaceCIDRAndGateway function constructs the CIDR for VpcPrefix interfaces using fmt.Sprintf("%s/%d", ifc.VpcPrefix.Prefix, ifc.VpcPrefix.PrefixLength). However, in production, VpcPrefix.Prefix is set to childPrefix.Cidr (from the IPAM library), which already includes the prefix length (e.g., "192.172.0.0/24"). This causes the CIDR to be constructed as "192.172.0.0/24/24" — an invalid CIDR that netip.ParsePrefix cannot parse.

As a result, ValidateIPAddresses will silently pass through all IPs without any validation whenever the interface has a VpcPrefix. The reserved IP rejection logic is therefore completely bypassed for VPC Prefix interfaces in production — which is exactly the scenario this PR is meant to fix.

The fix should either:

  1. Parse VpcPrefix.Prefix using netip.ParsePrefix and use only the masked address (stripping the existing prefix length if present), or
  2. Only use ifc.VpcPrefix.PrefixLength for the suffix if ifc.VpcPrefix.Prefix doesn't already contain a /, or
  3. Use fmt.Sprintf("%s/%d", strings.Split(ifc.VpcPrefix.Prefix, "/")[0], ifc.VpcPrefix.PrefixLength).

Note: This also makes the GetInterfaceCIDRAndGateway unit test misleading — it uses a bare IP address "10.0.0.0" as Prefix (not CIDR notation), which doesn't reflect the actual format stored in the database by the production code path.

Copilot uses AI. Check for mistakes.
@fabiendupont fabiendupont force-pushed the fix/ipam-reject-reserved-ip-addresses branch from 7ab150a to 0ff8f42 Compare March 10, 2026 14:47
@ajf
Copy link
Collaborator

ajf commented Mar 10, 2026

@thossain-nv @bcavnvidia Correct me if I'm wrong - The VPC prefix in this case (i.e a /28) when an instance is deployed is allocated a /31 out of the /28 so it should be impossible to allocate a broadcast address, since the underlying subnet isn't controllable and is always a /31

…terface update

The instance inventory workflow was persisting IP addresses from the Site
Controller without validation. This allowed reserved addresses (network,
broadcast, gateway) to be assigned to interfaces, causing network issues.

Add ValidateIPAddresses to check IPs against the interface's CIDR prefix.
When reserved IPs are detected, the interface is set to Error status and
invalid addresses are not persisted. The interface recovers automatically
on the next inventory cycle once the Site Controller reports valid IPs.

GetInterfaceCIDRAndGateway extracts the CIDR from Subnet (IPv4 and IPv6)
or VpcPrefix, handling the case where VpcPrefix.Prefix already contains
the prefix length from the IPAM library.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Copilot AI review requested due to automatic review settings March 10, 2026 16:52
@fabiendupont fabiendupont force-pushed the fix/ipam-reject-reserved-ip-addresses branch from 0ff8f42 to a17629e Compare March 10, 2026 16:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1069 to +1070
// Do not persist invalid IP addresses
ipAddresses = nil
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code sets ipAddresses = nil at line 1070 to prevent persisting invalid IP addresses, then passes it as IpAddresses: ipAddresses (i.e., nil) to interfaceDAO.Update. However, in db/pkg/db/model/interface.go (around line 422), the update logic only writes the ip_addresses column when input.IpAddresses != nil. Therefore, when a previously valid interface later receives a reserved IP, any IP addresses already stored in the database will remain — only the status changes to Error. This contradicts the comment // Do not persist invalid IP addresses and the test assertion's intent.

To actually clear the stored IPs, the update logic needs a way to explicitly set ip_addresses to an empty/null value. One approach is to use an empty slice ([]string{}) instead of nil and update the DAO to treat an empty (non-nil) slice as a clear operation, or add a separate ClearIPAddresses bool field to InterfaceUpdateInput. As-is, the self-healing behavior on the next cycle may overwrite the stale IPs with a valid set anyway, but the current cycle leaves stale IPs in an Error-status interface.

Suggested change
// Do not persist invalid IP addresses
ipAddresses = nil
// Do not persist invalid IP addresses: clear any stored IPs
ipAddresses = []string{}

Copilot uses AI. Check for mistakes.
rejectedIfc, serr := ifcDAO.GetByID(ctx, nil, ifc.ID, nil)
assert.Nil(t, serr)
assert.Equal(t, cdbm.InterfaceStatusError, rejectedIfc.Status, "interface with reserved IP should be in Error status")
assert.Nil(t, rejectedIfc.IPAddresses, "interface with reserved IP should not have IP addresses stored")
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test for the rejected-IP scenario creates the interface ifcvpc_reserved_ip with no pre-existing IP addresses, so the assertion assert.Nil(t, rejectedIfc.IPAddresses, "interface with reserved IP should not have IP addresses stored") will pass even with the bug described above. The test does not verify the actual semantics of clearing pre-existing IP addresses (e.g., an interface that had valid IPs in one cycle and then receives a reserved IP in the next). Consider adding a test scenario where the interface already has stored IPs before the reserved-IP inventory arrives, to properly verify the clearing behavior.

Suggested change
assert.Nil(t, rejectedIfc.IPAddresses, "interface with reserved IP should not have IP addresses stored")
// IPAddresses must be absent: allow both nil and empty, but fail if any IPs remain
assert.Empty(t, rejectedIfc.IPAddresses, "interface with reserved IP should not have IP addresses stored")

Copilot uses AI. Check for mistakes.
@thossain-nv
Copy link
Contributor

Thanks @fabiendupont @bcavnvidia will you please take a look as to how we might be ending up with invalid IP addresses?

@bcavnvidia
Copy link
Contributor

The VPC prefix in this case (i.e a /28) when an instance is deployed is allocated a /31 out of the /28 so it should be impossible to allocate a broadcast address, since the underlying subnet isn't controllable and is always a /31

@ajf Is correct. A VPC prefix is really just a container for a bunch of /31s. All of them are valid for use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants