perf: reduce K8s API calls during DNS server startup by shess-macu · Pull Request #81 · cyclops-k8s/cyclops-multicluster

shess-macu · 2026-04-20T02:55:48Z

Problem

DNS server pods were crash-looping due to health check timeouts. During startup, the DNS resolver was making ~455 Kubernetes API list calls per boot, each listing all 227 HostnameCache CRDs. With crash-restart loops this escalated to 7k+ calls, saturating the API client and preventing health probes from responding in time.

Root cause

Two code paths each did a full ListAsync<V1HostnameCache>() per hostname:

ResyncAsync() (init + every 30s) — called GetHostnamesAsync() (1 list-all), then for each of 227 hostnames called GetHostInformationAsync() → GetOrCreateHostnameCache() → another list-all = 228 calls
KubeOps reconciler — fired for all 227 HostnameCache CRDs at startup, each triggering RefreshHostInformationAsync() → same list-all chain = 227 more calls

Changes

Bulk fetch in `ResyncAsync` (228 calls → 1)

Added GetAllHostInformationAsync() to IKubernetesCache, ICache, KubernetesApiCache, and MemoryCache
Performs a single ListAsync<V1HostnameCache>() and returns all hosts as a dictionary
ResyncAsync() now uses this bulk method and passes pre-loaded Host data into RefreshHostInformationAsync()

Reconciler passes entity data (227 calls → 0)

Updated OnHostChangedAsyncDelegate to accept optional Host? hostInformation
K8sHostnameCacheController.ReconcileAsync() now builds the Host from the already-available entity and passes it through the queue
DefaultDnsResolver.RefreshHostInformationAsync() skips the API call when pre-loaded data is provided

Bug fix

Fully qualified Host type references to resolve ambiguity with Microsoft.Extensions.Hosting.Host

Impact

Phase	Before	After
`ResyncAsync` (init + every 30s)	228 list-all calls	1 list-all call
KubeOps reconciler (227 events)	227 list-all calls	0 API calls
Per startup total	~455 list-all calls	~1 list-all call

Files changed (10)

Services/IKubernetesCache.cs — added GetAllHostInformationAsync()
Services/ICache.cs — added GetAllHostInformationAsync()
Services/Default/KubernetesApiCache.cs — implemented bulk fetch
Services/Default/DefaultCache.cs — passthrough
Services/Default/DefaultDnsResolver.cs — refactored ResyncAsync and RefreshHostInformationAsync
Services/IQueue.cs — updated delegate signature
Services/Default/KubernetesQueue.cs — updated default delegate
Controllers/K8sHostnameCacheController.cs — pass entity data on reconcile
Tests: K8sHostnameCacheControllerTests.cs, KubernetesQueueTests.cs — updated delegate lambdas

Optimize the DNS server initialization to eliminate redundant Kubernetes API calls that were causing health check timeouts and crash loops. Changes: - Add GetAllHostInformationAsync() to IKubernetesCache, ICache, KubernetesApiCache, and MemoryCache. This performs a single ListAsync<V1HostnameCache>() and returns all hosts as a dictionary. - Refactor DefaultDnsResolver.ResyncAsync() to use the bulk fetch instead of calling GetHostInformationAsync() per hostname. Each per-hostname call was doing a full list-all-then-filter, so 227 hostnames meant 228 API calls. Now it is 1 call. - Update OnHostChangedAsyncDelegate to accept optional Host data so callers can pass pre-fetched entity information. - Update K8sHostnameCacheController.ReconcileAsync() to build Host data from the already-available V1HostnameCache entity and pass it through the queue, eliminating 227 API re-fetch calls at startup. - Fully qualify Host references to avoid ambiguity with Microsoft.Extensions.Hosting.Host. - Update tests to match new delegate signature.

The liveness probe was calling the Kubernetes API (GetAsync<V1Namespace>) every 10 seconds to verify connectivity. With a 1-second probe timeout on operator/orchestrator pods, any K8s API latency spike caused the probe to time out, killing the pod. Liveness probes should only verify the process is alive, not check external dependencies. Changed LivenessAsync to simply return 200 OK. The readiness probe retains the one-time K8s API check (cached after first success via the static _ready flag). Removed unused _dateTimeProvider and _lastUp fields from HealthzController.

Kubernetes strips exclusiveMinimum: false from stored CRDs since it is the JSON Schema default value. This caused ArgoCD to show a perpetual diff. Removed from both cyclops and vecc GSLB CRD definitions. The minimum: 0.0 constraint is unaffected.

spencerhess1 added 3 commits April 19, 2026 20:43

jondrusek-macu reviewed Apr 20, 2026

View reviewed changes

Comment thread src/Cyclops.MultiCluster/Services/Default/KubernetesApiCache.cs

jondrusek-macu reviewed Apr 20, 2026

View reviewed changes

Comment thread src/Cyclops.MultiCluster/Services/ICache.cs Outdated

jondrusek-macu reviewed Apr 20, 2026

View reviewed changes

Comment thread src/Cyclops.MultiCluster/Services/ICache.cs Outdated

updated based off pr comments

0603ede

jondrusek-macu approved these changes Apr 20, 2026

View reviewed changes

nbarber-macu reviewed Apr 20, 2026

View reviewed changes

Comment thread src/Cyclops.MultiCluster/Controllers/HealthzController.cs

updated liveness probe comment

1701792

nbarber-macu reviewed Apr 20, 2026

View reviewed changes

Comment thread src/Cyclops.MultiCluster/Controllers/HealthzController.cs Outdated

fixed comment

2236b0f

nbarber-macu approved these changes Apr 20, 2026

View reviewed changes

shess-macu merged commit 658b513 into cyclops-k8s:main Apr 21, 2026
2 checks passed

shess-macu deleted the shess/reducecalls branch April 21, 2026 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce K8s API calls during DNS server startup#81

perf: reduce K8s API calls during DNS server startup#81
shess-macu merged 6 commits intocyclops-k8s:mainfrom
shess-macu:shess/reducecalls

shess-macu commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shess-macu commented Apr 20, 2026

Problem

Root cause

Changes

Bulk fetch in ResyncAsync (228 calls → 1)

Reconciler passes entity data (227 calls → 0)

Bug fix

Impact

Files changed (10)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bulk fetch in `ResyncAsync` (228 calls → 1)