CLB 3rd PR: ServiceGateway Azure operations layer#10549
CLB 3rd PR: ServiceGateway Azure operations layer#10549georgeedward2000 wants to merge 18 commits into
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @georgeedward2000. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: georgeedward2000 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Part of #10428 |
e565025 to
e92dcf9
Compare
| @@ -0,0 +1,478 @@ | |||
| // Package difftracker provides state tracking and synchronization between Kubernetes | |||
There was a problem hiding this comment.
Done. Added the Apache license header to azure_operations.go, and also to resource_helpers.go which was missing it too.
| "strings" | ||
|
|
||
| "github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v9" | ||
| v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Well, based on the other files, we don’t separate them. Our goimports config sets local-prefixes to sigs.k8s.io/cloud-provider-azure, so there are three groups: stdlib, third party (github.com and k8s.io together), and our local module. This matches azure_loadbalancer.go and the rest of pkg/provider, so the current grouping is already correct and lint clean.
|
|
||
| // createOrUpdatePIP creates or updates a public IP address | ||
| func (dt *DiffTracker) createOrUpdatePIP(ctx context.Context, pipResourceGroup string, pip *armnetwork.PublicIPAddress) error { | ||
| _, err := dt.createOrUpdatePIPWithResponse(ctx, pipResourceGroup, pip) |
There was a problem hiding this comment.
Yes, the inbound path (in the engine PR) uses the response to read the allocated PIP IP and patch it into the Service status. The thin createOrUpdatePIP wrapper is used for the outbound path, which doesn’t require the IP. Both behaviors are intentional.
| klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName) | ||
|
|
||
| response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip) | ||
| klog.V(10).Infof("PublicIPAddressClient.CreateOrUpdate(%s, %s): end", pipResourceGroup, pipName) |
There was a problem hiding this comment.
which log level we should have here?
There was a problem hiding this comment.
Good catch, the levels here are inconsistent. V(10) is effectively never printed, while the matching start log is at Infof (always on), so the pair is mismatched. I’ll standardize the per‑SDK call begin/end traces at V(4) to align with the rest of the package, keep failures at Warningf/Errorf, and reserve Infof/V(2) only for meaningful state changes.
| } | ||
| klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName) | ||
|
|
||
| response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip) |
There was a problem hiding this comment.
just making sure - it's ok to call this from multiple goroutines?
There was a problem hiding this comment.
Copilot said:
Yes. The function keeps no shared mutable state: it only reads networkClientFactory, which is set once at construction, plus local variables. The Azure SDK clients are azcore-based and safe for concurrent use. On top of that, ServiceUpdater (engine PR) serializes per service (via the activeOperations map under a mutex), and each service has its own PIP name (uid-pip), so two goroutines never call CreateOrUpdate on the same PIP. Concurrency is also capped at 10 (currently) by the semaphore.
| } | ||
|
|
||
| // newIgnoreCaseSetFromSlice creates an IgnoreCaseSet from a slice of strings | ||
| func newIgnoreCaseSetFromSlice(items []string) *utilsets.IgnoreCaseSet { |
There was a problem hiding this comment.
can we do utilsets.NewString(items...) instead?
There was a problem hiding this comment.
Done. Replaced the helper with utilsets.NewString(serviceUID) at all call sites, since NewString is variadic, and removed newIgnoreCaseSetFromSlice. I also dropped its two tests, as NewString is already covered in pkg/util/sets.
| case intstr.String: | ||
| klog.Warningf("ExtractInboundConfigFromService: named targetPort %q is not supported for service %s/%s; rejecting", | ||
| port.TargetPort.StrVal, service.Namespace, service.Name) | ||
| return nil |
There was a problem hiding this comment.
should this return nil or separate err? not sure how this will be used
There was a problem hiding this comment.
Changed it to not return nil. Returning nil here would silently drop the port and break the service. Instead, I fall back to backendPort = port.Port (same default as when TargetPort is unset) and log a warning. We can switch to a hard error later if we decide named ports shouldn't be allowed.
|
|
||
| // GetServicesToSync handles the synchronization of services between K8s and NRP | ||
| func GetServicesToSync(k8sServices, nrpServices *utilsets.IgnoreCaseSet) SyncServicesReturnType { | ||
| klog.V(2).Infof("GetServicesToSync: K8s services (%d): %v", k8sServices.Len(), k8sServices.UnsortedList()) |
There was a problem hiding this comment.
we should probably call this only once if needed, iiuc each invocation enumerates it
There was a problem hiding this comment.
This file is part of the base DiffTracker PR #10068, not the ServiceGateway Azure ops layer in this PR, so the fix should land there. The concern is valid though: the UnsortedList args are evaluated before the klog call regardless of verbosity, so the enumeration runs even when v=2 is off. The fix is to wrap these logs in if klog.V(2).Enabled() so we only enumerate when the line will actually print. I will track it against #10068.
| Removals: utilsets.NewString(), | ||
| } | ||
|
|
||
| for _, service := range k8sServices.UnsortedList() { |
There was a problem hiding this comment.
can we use sets functions instead of those loops?
| Addresses: make(map[string]Address), | ||
| } | ||
| } else { | ||
| locationData := findLocationData(result, location) |
There was a problem hiding this comment.
is this helper really saving much here?
There was a problem hiding this comment.
This file is part of the base DiffTracker PR #10068, not the ServiceGateway ops layer in this PR, so this should be addressed there. On the point itself, I agree: the helper is a one-line map lookup used in a single place, so it doesn’t add much value. I would inline it as loc, ok := result.Locations[location] and use ok directly. As a side note, it currently returns the address of the range copy rather than the map entry, so a comma‑ok lookup is also cleaner and safer. I will track this against #10068.
Introduces the difftracker package with core K8s/NRP state tracking, diff computation, and state mutation logic. Includes comprehensive test coverage (25+ test functions).
- Added unit test `TestUpdateK8sPodRemoveUsesStoredIdentity` to verify correct decrementing of outbound identity reference counts when removing pods. - Updated `removePod` method to return the stored PublicOutboundIdentity of the removed pod for accurate reference counting. - Refactored `updateK8sPodLocked` to decrement the reference count using the authoritative identity from the pod rather than the input. - Improved logging verbosity in `GetServicesToSync` and `DeepEqual` methods for better traceability during debugging. - Introduced lock-free methods for synchronization operations to enhance performance and reduce contention. - Enhanced enum string representations to handle out-of-range values gracefully. - Simplified `Equals` method in `IgnoreCaseSet` to improve efficiency by removing redundant checks.
…UpdateAction, and SyncStatus
…dd set operations - Updated logging statements in UpdateNRPLoadBalancers and UpdateNRPNATGateways to remove unnecessary newline characters. - Simplified the GetServicesToSync function by utilizing set operations for additions and removals. - Enhanced equality checks in DiffTracker, SyncDiffTrackerReturnType, and LocationData to cover more edge cases. - Introduced Difference method in IgnoreCaseSet for set difference operations. - Added unit tests for new functionality and edge cases in LocationData and DiffTracker equality checks. - Removed unused helper functions and improved code readability.
DeepEqual() had no production callers; GetSyncOperations uses the lock-free deepEqualLocked() body under a single held lock. Tests now call deepEqualLocked() directly.
The full-update branch builds a fresh address map, so guarding the delete with !isFullUpdate only skipped a no-op while reading as an inconsistency. Restore the unconditional delete: empty-ServiceRef addresses are always removed, and an all-empty location is dropped by the trailing cleanup.
Make the pod Remove path identity-aware: clear only the matching outbound identity and delete the pod only once it has no inbound or outbound identities left. Previously removePod deleted the whole pod record, silently dropping the inbound (LoadBalancer) backing of a pod that was both an egress pod and an LB backend when its egress was removed/changed. Fix the whole-node removal path in getSyncLocationsAddressesLocked: when a node is gone from K8s but still present in NRP, enumerate each NRP address with an empty ServiceRef instead of emitting an empty Addresses map. An empty Addresses map under PartialUpdate is a no-op on the Service Gateway, so the stale addresses were never removed while the local model dropped the location, leaving NRP and the SGW permanently out of sync. Also make the PublicOutboundIdentity comparison in DiffTracker.Equals case-insensitive for consistency with the rest of the package. Adds regression tests for both fixes and updates TestUpdateK8sPod to pass the egress identity on Remove (matching how the engine drives it).
Emitting an empty Addresses map under PartialUpdate already deletes the whole location on the Service Gateway (a null/empty addresses array deletes the location). Enumerating each address with an empty ServiceRef instead deletes the addresses but can leave an empty location container behind, which is strictly worse than the location-level deletion. Revert getSyncLocationsAddressesLocked's gone-node branch to emit an empty Addresses map and update the test accordingly. The egress-remove inbound-loss fix and the EqualFold change from the previous commit are kept.
The bare "ready" collides with the conventional Service readiness meaning (a Service is "ready" once its external IP is provisioned and serving traffic). isServiceReadyToSync names the actual intent - whether the service is ready to be synced to the Service Gateway - without overloading "ready".
Adds the Azure operations + resource-builder layer for Container LB: - azure_operations.go: PIP/LB/NAT Gateway CRUD + ServiceGateway service/location updates - resource_helpers.go: inbound/outbound resource builders, config extraction - dto_mappers.go: K8s updates -> ServiceGateway ServicesDataDTO mapper - types.go: service config + ServiceGateway DTO types (deferred from CP1) - consts.go: NatGatewayIDTemplate Adapted to public armnetwork/v9 SDK. Tests ported from the engine branch.
Address review findings in the Azure operations layer: - Idempotent deletes (treat 404 as success); delete LB by UID without a cluster-wide Service list. - Service request mapping: no empty PublicNatGatewayID on removals, error on unknown ServiceType, allow empty FullUpdate, warn on unknown update action. - LB rules: only set EnableTCPReset on TCP, case-insensitive protocol with rejection of unsupported ones, frontend/backend port range validation, apply configured idle timeout. - Reject named targetPort instead of silently misrouting. - Preserve dual-stack/hostname ingress when programming the LB IP and retry on conflict. - Make NAT gateway disassociation idempotent and independent per side. - Nil-guard createOrUpdatePIPWithResponse; return "" for missing ID segments.
a8a736b to
e603285
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: georgeedward2000 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
georgeedward2000
left a comment
There was a problem hiding this comment.
addressed comments
| @@ -0,0 +1,478 @@ | |||
| // Package difftracker provides state tracking and synchronization between Kubernetes | |||
There was a problem hiding this comment.
Done. Added the Apache license header to azure_operations.go, and also to resource_helpers.go which was missing it too.
| "strings" | ||
|
|
||
| "github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v9" | ||
| v1 "k8s.io/api/core/v1" |
There was a problem hiding this comment.
Well, based on the other files, we don’t separate them. Our goimports config sets local-prefixes to sigs.k8s.io/cloud-provider-azure, so there are three groups: stdlib, third party (github.com and k8s.io together), and our local module. This matches azure_loadbalancer.go and the rest of pkg/provider, so the current grouping is already correct and lint clean.
|
|
||
| // createOrUpdatePIP creates or updates a public IP address | ||
| func (dt *DiffTracker) createOrUpdatePIP(ctx context.Context, pipResourceGroup string, pip *armnetwork.PublicIPAddress) error { | ||
| _, err := dt.createOrUpdatePIPWithResponse(ctx, pipResourceGroup, pip) |
There was a problem hiding this comment.
Yes, the inbound path (in the engine PR) uses the response to read the allocated PIP IP and patch it into the Service status. The thin createOrUpdatePIP wrapper is used for the outbound path, which doesn’t require the IP. Both behaviors are intentional.
| } | ||
| klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName) | ||
|
|
||
| response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip) |
There was a problem hiding this comment.
Copilot said:
Yes. The function keeps no shared mutable state: it only reads networkClientFactory, which is set once at construction, plus local variables. The Azure SDK clients are azcore-based and safe for concurrent use. On top of that, ServiceUpdater (engine PR) serializes per service (via the activeOperations map under a mutex), and each service has its own PIP name (uid-pip), so two goroutines never call CreateOrUpdate on the same PIP. Concurrency is also capped at 10 (currently) by the semaphore.
| klog.Infof("createOrUpdatePIPWithResponse(%s): start", pipName) | ||
|
|
||
| response, err := dt.networkClientFactory.GetPublicIPAddressClient().CreateOrUpdate(ctx, pipResourceGroup, pipName, *pip) | ||
| klog.V(10).Infof("PublicIPAddressClient.CreateOrUpdate(%s, %s): end", pipResourceGroup, pipName) |
There was a problem hiding this comment.
Good catch, the levels here are inconsistent. V(10) is effectively never printed, while the matching start log is at Infof (always on), so the pair is mismatched. I’ll standardize the per‑SDK call begin/end traces at V(4) to align with the rest of the package, keep failures at Warningf/Errorf, and reserve Infof/V(2) only for meaningful state changes.
|
|
||
| // Build LB rules and probes from config | ||
| var lbRules []*armnetwork.LoadBalancingRule | ||
| var probes []*armnetwork.Probe |
There was a problem hiding this comment.
Good point. Removed the probes variable entirely since it was always nil, and the Probes field now defaults to nil. I kept a short comment explaining that PodIP backend pools do not support health probes, which is why we do not create any.
| ruleName, frontendPort.Port, backendPort, frontendPort.Protocol, serviceUID) | ||
| } | ||
| } else { | ||
| klog.V(2).Infof("buildInboundServiceResources: no port configuration provided for service %s, creating LB without rules", serviceUID) |
There was a problem hiding this comment.
This is defensive now. After the named targetPort fix, a service with ports always yields frontend ports, and a LoadBalancer service always has at least one, so we no longer build a rule-less LB for real services. If this path were ever hit and ports were added later, UpdateService would diff the config and updateInboundService would rebuild the LB with rules, so it still reconciles correctly.
| } | ||
|
|
||
| // newIgnoreCaseSetFromSlice creates an IgnoreCaseSet from a slice of strings | ||
| func newIgnoreCaseSetFromSlice(items []string) *utilsets.IgnoreCaseSet { |
There was a problem hiding this comment.
Done. Replaced the helper with utilsets.NewString(serviceUID) at all call sites, since NewString is variadic, and removed newIgnoreCaseSetFromSlice. I also dropped its two tests, as NewString is already covered in pkg/util/sets.
|
|
||
| // GetServicesToSync handles the synchronization of services between K8s and NRP | ||
| func GetServicesToSync(k8sServices, nrpServices *utilsets.IgnoreCaseSet) SyncServicesReturnType { | ||
| klog.V(2).Infof("GetServicesToSync: K8s services (%d): %v", k8sServices.Len(), k8sServices.UnsortedList()) |
There was a problem hiding this comment.
This file is part of the base DiffTracker PR #10068, not the ServiceGateway Azure ops layer in this PR, so the fix should land there. The concern is valid though: the UnsortedList args are evaluated before the klog call regardless of verbosity, so the enumeration runs even when v=2 is off. The fix is to wrap these logs in if klog.V(2).Enabled() so we only enumerate when the line will actually print. I will track it against #10068.
| Addresses: make(map[string]Address), | ||
| } | ||
| } else { | ||
| locationData := findLocationData(result, location) |
There was a problem hiding this comment.
This file is part of the base DiffTracker PR #10068, not the ServiceGateway ops layer in this PR, so this should be addressed there. On the point itself, I agree: the helper is a one-line map lookup used in a single place, so it doesn’t add much value. I would inline it as loc, ok := result.Locations[location] and use ok directly. As a side note, it currently returns the address of the range copy rather than the map entry, so a comma‑ok lookup is also cleaner and safer. I will track this against #10068.
…prove logging verbosity, and update tests for backend port validation
…service gateway removal DTO
…corresponding test
Draft - stacked PR. Based on #10068 (CLB 2nd PR,DiffTracker core) and #10543 (azclient v0.20.9 bump), neither yet in master. Please review only the top commit
CP2: ServiceGatewayAzure operations layer(pkg/consts+pkg/provider/difftracker/*). The vendoredazclientchurn and DiffTracker-core files belong to the base PRs and will disappear once they merge and this branch is rebased onto master.What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds the ServiceGateway Azure operations layer for Container/Service Load Balancer: Public IP, Load Balancer, and NAT Gateway CRUD plus ServiceGateway service and address-location updates, along with the DTO types and mappers that translate DiffTracker sync output into Azure SDK requests.
Which issue(s) this PR fixes:
Part of #10428
Special notes for your reviewer:
Stacked on #10068 and #10543. Will be rebased onto master (dropping the base commits) once those merge, leaving only the files of this operations layer.
Does this PR introduce a user-facing change?