fix(source): skip sources that fail to initialize instead of crashing#6185
fix(source): skip sources that fail to initialize instead of crashing#6185nicholasklem wants to merge 1 commit intokubernetes-sigs:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @nicholasklem! |
|
Hi @nicholasklem. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When external-dns is configured with sources whose CRDs are not installed (e.g., gateway-grpcroute on a cluster without Gateway API), it crashes on startup with a 60-second informer cache sync timeout. This changes ByNames() to log a warning and skip sources that fail to initialize, rather than treating all init errors as fatal. Unknown source names (ErrSourceNotFound) still hard-fail immediately. If no sources initialize successfully, an error is returned. This is the minimal change: ~12 lines in ByNames() with no new types, no new packages, and no hardcoded source classification.
7b722eb to
2314d9f
Compare
|
Thanks for PR. This is very controversial. I don’t think external-dns should conditionally tolerate user misconfiguration. Failing fast and explicitly on invalid or unsupported sources is preferable to silently skipping them. Otherwise we risk masking real configuration issues and making failures harder to diagnose. The issue with istio could be not just CRD is missing, but RBAC or some addmission webhook configuration on the cluster. If someone will try to execute unsupported source, external dns must crash. We are not going to tolerate missconfiguration. Here are concrete practices users should follow to avoid these failures in production:
Net: production safety should come from clear configs, preflight validation, and environment parity, not from external-dns trying to guess what the user meant. |
Summary
When external-dns is configured with sources whose CRDs are not installed (e.g.,
--source=gateway-grpcrouteon a cluster without Gateway API), it crashes on startup with a 60-second informer cache sync timeout followed bylog.Fatal.This changes
ByNames()to log a warning and skip sources that fail to initialize, rather than treating all init errors as fatal.ErrSourceNotFound) → hard fail (catches typos)No new types, no new packages, no hardcoded source classification. ~12 lines of production code.
Motivation
This is a common problem when deploying external-dns via Helm/ArgoCD across clusters with heterogeneous CRD installations. The Helm chart configures multiple source types, but not every cluster has every CRD installed. Today, external-dns crashloops on those clusters instead of running with the sources that work.
Addresses issue 4768, related to issue 3845.
Example
Tested on a real GKE cluster with Gateway API installed but no Istio CRDs:
Before this change, that warning would have been
level=fataland external-dns would have exited.Known trade-offs
Informer timeout latency. When a source depends on a missing CRD, the informer cache sync blocks for up to 60 seconds before timing out. The skip happens after that timeout, not immediately. This is an upstream informer behavior, not something we can easily change here.
Silent degradation. If an operator explicitly configures
--source=istio-gatewayexpecting it to work, the source is skipped with only a warning log. They won't discover the problem until they notice missing DNS records. Possible future improvements:--required-sourcesflag for strict mode on specific sourcesThese are follow-ups — the current behavior (warn and skip) is strictly better than the current behavior (crash and lose all sources).
Test plan
TestOptionalSourceSkipped— service + gateway, gateway fails → 1 source returnedTestAllOptionalSourcesSkipped— only gateways, all fail → errorTestSingleSourceFailsReturnsError— single source fails alone → errorTestMixedSourcesPartialInit— service + istio, istio fails → 1 sourceTestOptionalSourceSkippedLogsWarning— warning logged with source nameTestAllSourceTypesHandled— every known type handled by BuildWithConfigTestUnknownSourceStillFails— unknown name → ErrSourceNotFoundTestEmptySourcesList— empty config → errorByNamessuite tests pass unchanged