perf: reduce K8s API calls during DNS server startup#81
Merged
shess-macu merged 6 commits intocyclops-k8s:mainfrom Apr 21, 2026
Merged
perf: reduce K8s API calls during DNS server startup#81shess-macu merged 6 commits intocyclops-k8s:mainfrom
shess-macu merged 6 commits intocyclops-k8s:mainfrom
Conversation
Optimize the DNS server initialization to eliminate redundant Kubernetes API calls that were causing health check timeouts and crash loops. Changes: - Add GetAllHostInformationAsync() to IKubernetesCache, ICache, KubernetesApiCache, and MemoryCache. This performs a single ListAsync<V1HostnameCache>() and returns all hosts as a dictionary. - Refactor DefaultDnsResolver.ResyncAsync() to use the bulk fetch instead of calling GetHostInformationAsync() per hostname. Each per-hostname call was doing a full list-all-then-filter, so 227 hostnames meant 228 API calls. Now it is 1 call. - Update OnHostChangedAsyncDelegate to accept optional Host data so callers can pass pre-fetched entity information. - Update K8sHostnameCacheController.ReconcileAsync() to build Host data from the already-available V1HostnameCache entity and pass it through the queue, eliminating 227 API re-fetch calls at startup. - Fully qualify Host references to avoid ambiguity with Microsoft.Extensions.Hosting.Host. - Update tests to match new delegate signature.
The liveness probe was calling the Kubernetes API (GetAsync<V1Namespace>) every 10 seconds to verify connectivity. With a 1-second probe timeout on operator/orchestrator pods, any K8s API latency spike caused the probe to time out, killing the pod. Liveness probes should only verify the process is alive, not check external dependencies. Changed LivenessAsync to simply return 200 OK. The readiness probe retains the one-time K8s API check (cached after first success via the static _ready flag). Removed unused _dateTimeProvider and _lastUp fields from HealthzController.
Kubernetes strips exclusiveMinimum: false from stored CRDs since it is the JSON Schema default value. This caused ArgoCD to show a perpetual diff. Removed from both cyclops and vecc GSLB CRD definitions. The minimum: 0.0 constraint is unaffected.
jondrusek-macu
approved these changes
Apr 20, 2026
nbarber-macu
approved these changes
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DNS server pods were crash-looping due to health check timeouts. During startup, the DNS resolver was making ~455 Kubernetes API list calls per boot, each listing all 227
HostnameCacheCRDs. With crash-restart loops this escalated to 7k+ calls, saturating the API client and preventing health probes from responding in time.Root cause
Two code paths each did a full
ListAsync<V1HostnameCache>()per hostname:ResyncAsync()(init + every 30s) — calledGetHostnamesAsync()(1 list-all), then for each of 227 hostnames calledGetHostInformationAsync()→GetOrCreateHostnameCache()→ another list-all = 228 callsHostnameCacheCRDs at startup, each triggeringRefreshHostInformationAsync()→ same list-all chain = 227 more callsChanges
Bulk fetch in
ResyncAsync(228 calls → 1)GetAllHostInformationAsync()toIKubernetesCache,ICache,KubernetesApiCache, andMemoryCacheListAsync<V1HostnameCache>()and returns all hosts as a dictionaryResyncAsync()now uses this bulk method and passes pre-loadedHostdata intoRefreshHostInformationAsync()Reconciler passes entity data (227 calls → 0)
OnHostChangedAsyncDelegateto accept optionalHost? hostInformationK8sHostnameCacheController.ReconcileAsync()now builds theHostfrom the already-available entity and passes it through the queueDefaultDnsResolver.RefreshHostInformationAsync()skips the API call when pre-loaded data is providedBug fix
Hosttype references to resolve ambiguity withMicrosoft.Extensions.Hosting.HostImpact
ResyncAsync(init + every 30s)Files changed (10)
Services/IKubernetesCache.cs— addedGetAllHostInformationAsync()Services/ICache.cs— addedGetAllHostInformationAsync()Services/Default/KubernetesApiCache.cs— implemented bulk fetchServices/Default/DefaultCache.cs— passthroughServices/Default/DefaultDnsResolver.cs— refactoredResyncAsyncandRefreshHostInformationAsyncServices/IQueue.cs— updated delegate signatureServices/Default/KubernetesQueue.cs— updated default delegateControllers/K8sHostnameCacheController.cs— pass entity data on reconcileK8sHostnameCacheControllerTests.cs,KubernetesQueueTests.cs— updated delegate lambdas