Skip to content

perf: reduce K8s API calls during DNS server startup#81

Merged
shess-macu merged 6 commits intocyclops-k8s:mainfrom
shess-macu:shess/reducecalls
Apr 21, 2026
Merged

perf: reduce K8s API calls during DNS server startup#81
shess-macu merged 6 commits intocyclops-k8s:mainfrom
shess-macu:shess/reducecalls

Conversation

@shess-macu
Copy link
Copy Markdown
Contributor

Problem

DNS server pods were crash-looping due to health check timeouts. During startup, the DNS resolver was making ~455 Kubernetes API list calls per boot, each listing all 227 HostnameCache CRDs. With crash-restart loops this escalated to 7k+ calls, saturating the API client and preventing health probes from responding in time.

Root cause

Two code paths each did a full ListAsync<V1HostnameCache>() per hostname:

  1. ResyncAsync() (init + every 30s) — called GetHostnamesAsync() (1 list-all), then for each of 227 hostnames called GetHostInformationAsync()GetOrCreateHostnameCache() → another list-all = 228 calls
  2. KubeOps reconciler — fired for all 227 HostnameCache CRDs at startup, each triggering RefreshHostInformationAsync() → same list-all chain = 227 more calls

Changes

Bulk fetch in ResyncAsync (228 calls → 1)

  • Added GetAllHostInformationAsync() to IKubernetesCache, ICache, KubernetesApiCache, and MemoryCache
  • Performs a single ListAsync<V1HostnameCache>() and returns all hosts as a dictionary
  • ResyncAsync() now uses this bulk method and passes pre-loaded Host data into RefreshHostInformationAsync()

Reconciler passes entity data (227 calls → 0)

  • Updated OnHostChangedAsyncDelegate to accept optional Host? hostInformation
  • K8sHostnameCacheController.ReconcileAsync() now builds the Host from the already-available entity and passes it through the queue
  • DefaultDnsResolver.RefreshHostInformationAsync() skips the API call when pre-loaded data is provided

Bug fix

  • Fully qualified Host type references to resolve ambiguity with Microsoft.Extensions.Hosting.Host

Impact

Phase Before After
ResyncAsync (init + every 30s) 228 list-all calls 1 list-all call
KubeOps reconciler (227 events) 227 list-all calls 0 API calls
Per startup total ~455 list-all calls ~1 list-all call

Files changed (10)

  • Services/IKubernetesCache.cs — added GetAllHostInformationAsync()
  • Services/ICache.cs — added GetAllHostInformationAsync()
  • Services/Default/KubernetesApiCache.cs — implemented bulk fetch
  • Services/Default/DefaultCache.cs — passthrough
  • Services/Default/DefaultDnsResolver.cs — refactored ResyncAsync and RefreshHostInformationAsync
  • Services/IQueue.cs — updated delegate signature
  • Services/Default/KubernetesQueue.cs — updated default delegate
  • Controllers/K8sHostnameCacheController.cs — pass entity data on reconcile
  • Tests: K8sHostnameCacheControllerTests.cs, KubernetesQueueTests.cs — updated delegate lambdas

Optimize the DNS server initialization to eliminate redundant Kubernetes
API calls that were causing health check timeouts and crash loops.

Changes:
- Add GetAllHostInformationAsync() to IKubernetesCache, ICache,
  KubernetesApiCache, and MemoryCache. This performs a single
  ListAsync<V1HostnameCache>() and returns all hosts as a dictionary.
- Refactor DefaultDnsResolver.ResyncAsync() to use the bulk fetch
  instead of calling GetHostInformationAsync() per hostname. Each
  per-hostname call was doing a full list-all-then-filter, so 227
  hostnames meant 228 API calls. Now it is 1 call.
- Update OnHostChangedAsyncDelegate to accept optional Host data so
  callers can pass pre-fetched entity information.
- Update K8sHostnameCacheController.ReconcileAsync() to build Host
  data from the already-available V1HostnameCache entity and pass it
  through the queue, eliminating 227 API re-fetch calls at startup.
- Fully qualify Host references to avoid ambiguity with
  Microsoft.Extensions.Hosting.Host.
- Update tests to match new delegate signature.
The liveness probe was calling the Kubernetes API (GetAsync<V1Namespace>)
every 10 seconds to verify connectivity. With a 1-second probe timeout
on operator/orchestrator pods, any K8s API latency spike caused the
probe to time out, killing the pod.

Liveness probes should only verify the process is alive, not check
external dependencies. Changed LivenessAsync to simply return 200 OK.

The readiness probe retains the one-time K8s API check (cached after
first success via the static _ready flag).

Removed unused _dateTimeProvider and _lastUp fields from HealthzController.
Kubernetes strips exclusiveMinimum: false from stored CRDs since it is
the JSON Schema default value. This caused ArgoCD to show a perpetual
diff. Removed from both cyclops and vecc GSLB CRD definitions.

The minimum: 0.0 constraint is unaffected.
Comment thread src/Cyclops.MultiCluster/Services/Default/KubernetesApiCache.cs
Comment thread src/Cyclops.MultiCluster/Services/ICache.cs Outdated
Comment thread src/Cyclops.MultiCluster/Services/ICache.cs Outdated
Comment thread src/Cyclops.MultiCluster/Controllers/HealthzController.cs
Comment thread src/Cyclops.MultiCluster/Controllers/HealthzController.cs Outdated
@shess-macu shess-macu merged commit 658b513 into cyclops-k8s:main Apr 21, 2026
2 checks passed
@shess-macu shess-macu deleted the shess/reducecalls branch April 21, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants