Skip to content

CLB 3rd PR: ServiceGateway Azure operations layer#10549

Open
georgeedward2000 wants to merge 18 commits into
kubernetes-sigs:masterfrom
georgeedward2000:cp2-azure-ops
Open

CLB 3rd PR: ServiceGateway Azure operations layer#10549
georgeedward2000 wants to merge 18 commits into
kubernetes-sigs:masterfrom
georgeedward2000:cp2-azure-ops

Conversation

@georgeedward2000

Copy link
Copy Markdown
Contributor

Draft - stacked PR. Based on #10068 (CLB 2nd PR,DiffTracker core) and #10543 (azclient v0.20.9 bump), neither yet in master. Please review only the top commit CP2: ServiceGatewayAzure operations layer (pkg/consts +pkg/provider/difftracker/*). The vendored azclient churn and DiffTracker-core files belong to the base PRs and will disappear once they merge and this branch is rebased onto master.

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds the ServiceGateway Azure operations layer for Container/Service Load Balancer: Public IP, Load Balancer, and NAT Gateway CRUD plus ServiceGateway service and address-location updates, along with the DTO types and mappers that translate DiffTracker sync output into Azure SDK requests.

Which issue(s) this PR fixes:

Part of #10428

Special notes for your reviewer:

Stacked on #10068 and #10543. Will be rebased onto master (dropping the base commits) once those merge, leaving only the files of this operations layer.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-kind labels Jun 17, 2026
@github-actions github-actions Bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 17, 2026
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 17, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @georgeedward2000. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 17, 2026
@k8s-ci-robot k8s-ci-robot requested review from Liunardy and nilo19 June 17, 2026 12:24
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: georgeedward2000
Once this PR has been reviewed and has the lgtm label, please assign liunardy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nilo19

nilo19 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Part of #10428

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 19, 2026
@@ -0,0 +1,478 @@
// Package difftracker provides state tracking and synchronization between Kubernetes

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add license text

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added the Apache license header to azure_operations.go, and also to resource_helpers.go which was missing it too.

"strings"

"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v9"
v1 "k8s.io/api/core/v1"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we separate k8s.* and others?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, based on the other files, we don’t separate them. Our goimports config sets local-prefixes to sigs.k8s.io/cloud-provider-azure, so there are three groups: stdlib, third party (github.com and k8s.io together), and our local module. This matches azure_loadbalancer.go and the rest of pkg/provider, so the current grouping is already correct and lint clean.


// createOrUpdatePIP creates or updates a public IP address
func (dt *DiffTracker) createOrUpdatePIP(ctx context.Context, pipResourceGroup string, pip *armnetwork.PublicIPAddress) error {
_, err := dt.createOrUpdatePIPWithResponse(ctx, pipResourceGroup, pip)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever need the response?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the inbound path (in the engine PR) uses the response to read the allocated PIP IP and patch it into the Service status. The thin createOrUpdatePIP wrapper is used for the outbound path, which doesn’t require the IP. Both behaviors are intentional.

klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName)

response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip)
klog.V(10).Infof("PublicIPAddressClient.CreateOrUpdate(%s, %s): end", pipResourceGroup, pipName)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which log level we should have here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the levels here are inconsistent. V(10) is effectively never printed, while the matching start log is at Infof (always on), so the pair is mismatched. I’ll standardize the per‑SDK call begin/end traces at V(4) to align with the rest of the package, keep failures at Warningf/Errorf, and reserve Infof/V(2) only for meaningful state changes.

}
klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName)

response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure - it's ok to call this from multiple goroutines?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot said:

Yes. The function keeps no shared mutable state: it only reads networkClientFactory, which is set once at construction, plus local variables. The Azure SDK clients are azcore-based and safe for concurrent use. On top of that, ServiceUpdater (engine PR) serializes per service (via the activeOperations map under a mutex), and each service has its own PIP name (uid-pip), so two goroutines never call CreateOrUpdate on the same PIP. Concurrency is also capped at 10 (currently) by the semaphore.

}

// newIgnoreCaseSetFromSlice creates an IgnoreCaseSet from a slice of strings
func newIgnoreCaseSetFromSlice(items []string) *utilsets.IgnoreCaseSet {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do utilsets.NewString(items...) instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced the helper with utilsets.NewString(serviceUID) at all call sites, since NewString is variadic, and removed newIgnoreCaseSetFromSlice. I also dropped its two tests, as NewString is already covered in pkg/util/sets.

case intstr.String:
klog.Warningf("ExtractInboundConfigFromService: named targetPort %q is not supported for service %s/%s; rejecting",
port.TargetPort.StrVal, service.Namespace, service.Name)
return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return nil or separate err? not sure how this will be used

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to not return nil. Returning nil here would silently drop the port and break the service. Instead, I fall back to  backendPort = port.Port  (same default as when TargetPort is unset) and log a warning. We can switch to a hard error later if we decide named ports shouldn't be allowed.


// GetServicesToSync handles the synchronization of services between K8s and NRP
func GetServicesToSync(k8sServices, nrpServices *utilsets.IgnoreCaseSet) SyncServicesReturnType {
klog.V(2).Infof("GetServicesToSync: K8s services (%d): %v", k8sServices.Len(), k8sServices.UnsortedList())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably call this only once if needed, iiuc each invocation enumerates it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is part of the base DiffTracker PR #10068, not the ServiceGateway Azure ops layer in this PR, so the fix should land there. The concern is valid though: the UnsortedList args are evaluated before the klog call regardless of verbosity, so the enumeration runs even when v=2 is off. The fix is to wrap these logs in if klog.V(2).Enabled() so we only enumerate when the line will actually print. I will track it against #10068.

Removals: utilsets.NewString(),
}

for _, service := range k8sServices.UnsortedList() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use sets functions instead of those loops?

Addresses: make(map[string]Address),
}
} else {
locationData := findLocationData(result, location)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this helper really saving much here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is part of the base DiffTracker PR #10068, not the ServiceGateway ops layer in this PR, so this should be addressed there. On the point itself, I agree: the helper is a one-line map lookup used in a single place, so it doesn’t add much value. I would inline it as loc, ok := result.Locations[location] and use ok directly. As a side note, it currently returns the address of the range copy rather than the map entry, so a comma‑ok lookup is also cleaner and safer. I will track this against #10068.

Introduces the difftracker package with core K8s/NRP state tracking,
diff computation, and state mutation logic.
Includes comprehensive test coverage (25+ test functions).
- Added unit test `TestUpdateK8sPodRemoveUsesStoredIdentity` to verify correct decrementing of outbound identity reference counts when removing pods.
- Updated `removePod` method to return the stored PublicOutboundIdentity of the removed pod for accurate reference counting.
- Refactored `updateK8sPodLocked` to decrement the reference count using the authoritative identity from the pod rather than the input.
- Improved logging verbosity in `GetServicesToSync` and `DeepEqual` methods for better traceability during debugging.
- Introduced lock-free methods for synchronization operations to enhance performance and reduce contention.
- Enhanced enum string representations to handle out-of-range values gracefully.
- Simplified `Equals` method in `IgnoreCaseSet` to improve efficiency by removing redundant checks.
…dd set operations

- Updated logging statements in UpdateNRPLoadBalancers and UpdateNRPNATGateways to remove unnecessary newline characters.
- Simplified the GetServicesToSync function by utilizing set operations for additions and removals.
- Enhanced equality checks in DiffTracker, SyncDiffTrackerReturnType, and LocationData to cover more edge cases.
- Introduced Difference method in IgnoreCaseSet for set difference operations.
- Added unit tests for new functionality and edge cases in LocationData and DiffTracker equality checks.
- Removed unused helper functions and improved code readability.
DeepEqual() had no production callers; GetSyncOperations uses the lock-free
deepEqualLocked() body under a single held lock. Tests now call
deepEqualLocked() directly.
The full-update branch builds a fresh address map, so guarding the delete with
!isFullUpdate only skipped a no-op while reading as an inconsistency. Restore the
unconditional delete: empty-ServiceRef addresses are always removed, and an
all-empty location is dropped by the trailing cleanup.
Make the pod Remove path identity-aware: clear only the matching outbound
identity and delete the pod only once it has no inbound or outbound identities
left. Previously removePod deleted the whole pod record, silently dropping the
inbound (LoadBalancer) backing of a pod that was both an egress pod and an LB
backend when its egress was removed/changed.

Fix the whole-node removal path in getSyncLocationsAddressesLocked: when a node
is gone from K8s but still present in NRP, enumerate each NRP address with an
empty ServiceRef instead of emitting an empty Addresses map. An empty Addresses
map under PartialUpdate is a no-op on the Service Gateway, so the stale addresses
were never removed while the local model dropped the location, leaving NRP and
the SGW permanently out of sync.

Also make the PublicOutboundIdentity comparison in DiffTracker.Equals
case-insensitive for consistency with the rest of the package.

Adds regression tests for both fixes and updates TestUpdateK8sPod to pass the
egress identity on Remove (matching how the engine drives it).
Emitting an empty Addresses map under PartialUpdate already deletes the whole
location on the Service Gateway (a null/empty addresses array deletes the
location). Enumerating each address with an empty ServiceRef instead deletes the
addresses but can leave an empty location container behind, which is strictly
worse than the location-level deletion. Revert getSyncLocationsAddressesLocked's
gone-node branch to emit an empty Addresses map and update the test accordingly.

The egress-remove inbound-loss fix and the EqualFold change from the previous
commit are kept.
The bare "ready" collides with the conventional Service readiness meaning (a
Service is "ready" once its external IP is provisioned and serving traffic).
isServiceReadyToSync names the actual intent - whether the service is ready to
be synced to the Service Gateway - without overloading "ready".
Adds the Azure operations + resource-builder layer for Container LB:
- azure_operations.go: PIP/LB/NAT Gateway CRUD + ServiceGateway service/location updates
- resource_helpers.go: inbound/outbound resource builders, config extraction
- dto_mappers.go: K8s updates -> ServiceGateway ServicesDataDTO mapper
- types.go: service config + ServiceGateway DTO types (deferred from CP1)
- consts.go: NatGatewayIDTemplate
Adapted to public armnetwork/v9 SDK. Tests ported from the engine branch.
Address review findings in the Azure operations layer:
- Idempotent deletes (treat 404 as success); delete LB by UID without a
  cluster-wide Service list.
- Service request mapping: no empty PublicNatGatewayID on removals, error on
  unknown ServiceType, allow empty FullUpdate, warn on unknown update action.
- LB rules: only set EnableTCPReset on TCP, case-insensitive protocol with
  rejection of unsupported ones, frontend/backend port range validation, apply
  configured idle timeout.
- Reject named targetPort instead of silently misrouting.
- Preserve dual-stack/hostname ingress when programming the LB IP and retry on
  conflict.
- Make NAT gateway disassociation idempotent and independent per side.
- Nil-guard createOrUpdatePIPWithResponse; return "" for missing ID segments.
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: georgeedward2000
Once this PR has been reviewed and has the lgtm label, please assign liunardy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@georgeedward2000 georgeedward2000 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed comments

@@ -0,0 +1,478 @@
// Package difftracker provides state tracking and synchronization between Kubernetes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added the Apache license header to azure_operations.go, and also to resource_helpers.go which was missing it too.

"strings"

"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v9"
v1 "k8s.io/api/core/v1"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, based on the other files, we don’t separate them. Our goimports config sets local-prefixes to sigs.k8s.io/cloud-provider-azure, so there are three groups: stdlib, third party (github.com and k8s.io together), and our local module. This matches azure_loadbalancer.go and the rest of pkg/provider, so the current grouping is already correct and lint clean.


// createOrUpdatePIP creates or updates a public IP address
func (dt *DiffTracker) createOrUpdatePIP(ctx context.Context, pipResourceGroup string, pip *armnetwork.PublicIPAddress) error {
_, err := dt.createOrUpdatePIPWithResponse(ctx, pipResourceGroup, pip)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the inbound path (in the engine PR) uses the response to read the allocated PIP IP and patch it into the Service status. The thin createOrUpdatePIP wrapper is used for the outbound path, which doesn’t require the IP. Both behaviors are intentional.

}
klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName)

response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot said:

Yes. The function keeps no shared mutable state: it only reads networkClientFactory, which is set once at construction, plus local variables. The Azure SDK clients are azcore-based and safe for concurrent use. On top of that, ServiceUpdater (engine PR) serializes per service (via the activeOperations map under a mutex), and each service has its own PIP name (uid-pip), so two goroutines never call CreateOrUpdate on the same PIP. Concurrency is also capped at 10 (currently) by the semaphore.

klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName)

response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip)
klog.V(10).Infof("PublicIPAddressClient.CreateOrUpdate(%s, %s): end", pipResourceGroup, pipName)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the levels here are inconsistent. V(10) is effectively never printed, while the matching start log is at Infof (always on), so the pair is mismatched. I’ll standardize the per‑SDK call begin/end traces at V(4) to align with the rest of the package, keep failures at Warningf/Errorf, and reserve Infof/V(2) only for meaningful state changes.


// Build LB rules and probes from config
var lbRules []*armnetwork.LoadBalancingRule
var probes []*armnetwork.Probe

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Removed the probes variable entirely since it was always nil, and the Probes field now defaults to nil. I kept a short comment explaining that PodIP backend pools do not support health probes, which is why we do not create any.

ruleName, frontendPort.Port, backendPort, frontendPort.Protocol, serviceUID)
}
} else {
klog.V(2).Infof("buildInboundServiceResources: no port configuration provided for service %s, creating LB without rules", serviceUID)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defensive now. After the named targetPort fix, a service with ports always yields frontend ports, and a LoadBalancer service always has at least one, so we no longer build a rule-less LB for real services. If this path were ever hit and ports were added later, UpdateService would diff the config and updateInboundService would rebuild the LB with rules, so it still reconciles correctly.

}

// newIgnoreCaseSetFromSlice creates an IgnoreCaseSet from a slice of strings
func newIgnoreCaseSetFromSlice(items []string) *utilsets.IgnoreCaseSet {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced the helper with utilsets.NewString(serviceUID) at all call sites, since NewString is variadic, and removed newIgnoreCaseSetFromSlice. I also dropped its two tests, as NewString is already covered in pkg/util/sets.


// GetServicesToSync handles the synchronization of services between K8s and NRP
func GetServicesToSync(k8sServices, nrpServices *utilsets.IgnoreCaseSet) SyncServicesReturnType {
klog.V(2).Infof("GetServicesToSync: K8s services (%d): %v", k8sServices.Len(), k8sServices.UnsortedList())

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is part of the base DiffTracker PR #10068, not the ServiceGateway Azure ops layer in this PR, so the fix should land there. The concern is valid though: the UnsortedList args are evaluated before the klog call regardless of verbosity, so the enumeration runs even when v=2 is off. The fix is to wrap these logs in if klog.V(2).Enabled() so we only enumerate when the line will actually print. I will track it against #10068.

Addresses: make(map[string]Address),
}
} else {
locationData := findLocationData(result, location)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is part of the base DiffTracker PR #10068, not the ServiceGateway ops layer in this PR, so this should be addressed there. On the point itself, I agree: the helper is a one-line map lookup used in a single place, so it doesn’t add much value. I would inline it as loc, ok := result.Locations[location] and use ok directly. As a side note, it currently returns the address of the range copy rather than the map entry, so a comma‑ok lookup is also cleaner and safer. I will track this against #10068.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants