Skip to content

fix(source): skip sources that fail to initialize instead of crashing#6185

Open
nicholasklem wants to merge 1 commit intokubernetes-sigs:masterfrom
nicholasklem:fix/graceful-source-init
Open

fix(source): skip sources that fail to initialize instead of crashing#6185
nicholasklem wants to merge 1 commit intokubernetes-sigs:masterfrom
nicholasklem:fix/graceful-source-init

Conversation

@nicholasklem
Copy link

@nicholasklem nicholasklem commented Feb 10, 2026

Summary

When external-dns is configured with sources whose CRDs are not installed (e.g., --source=gateway-grpcroute on a cluster without Gateway API), it crashes on startup with a 60-second informer cache sync timeout followed by log.Fatal.

This changes ByNames() to log a warning and skip sources that fail to initialize, rather than treating all init errors as fatal.

  • Source init fails → warn and skip
  • Unknown source name (ErrSourceNotFound) → hard fail (catches typos)
  • Zero sources initialized → hard fail (catches total misconfiguration)

No new types, no new packages, no hardcoded source classification. ~12 lines of production code.

Motivation

This is a common problem when deploying external-dns via Helm/ArgoCD across clusters with heterogeneous CRD installations. The Helm chart configures multiple source types, but not every cluster has every CRD installed. Today, external-dns crashloops on those clusters instead of running with the sources that work.

Addresses issue 4768, related to issue 3845.

Example

Tested on a real GKE cluster with Gateway API installed but no Istio CRDs:

$ external-dns --source=service --source=istio-gateway --once --dry-run ...

level=info  msg="Created Kubernetes client https://..."
level=warning msg="Skipping source \"istio-gateway\" due to initialization error: failed to sync *v1beta1.Gateway after 1m0s: context deadline exceeded"

Before this change, that warning would have been level=fatal and external-dns would have exited.

Known trade-offs

Informer timeout latency. When a source depends on a missing CRD, the informer cache sync blocks for up to 60 seconds before timing out. The skip happens after that timeout, not immediately. This is an upstream informer behavior, not something we can easily change here.

Silent degradation. If an operator explicitly configures --source=istio-gateway expecting it to work, the source is skipped with only a warning log. They won't discover the problem until they notice missing DNS records. Possible future improvements:

  • Startup summary log listing which sources initialized vs skipped
  • --required-sources flag for strict mode on specific sources
  • Faster CRD detection via API discovery instead of waiting for informer timeout

These are follow-ups — the current behavior (warn and skip) is strictly better than the current behavior (crash and lose all sources).

Test plan

  • TestOptionalSourceSkipped — service + gateway, gateway fails → 1 source returned
  • TestAllOptionalSourcesSkipped — only gateways, all fail → error
  • TestSingleSourceFailsReturnsError — single source fails alone → error
  • TestMixedSourcesPartialInit — service + istio, istio fails → 1 source
  • TestOptionalSourceSkippedLogsWarning — warning logged with source name
  • TestAllSourceTypesHandled — every known type handled by BuildWithConfig
  • TestUnknownSourceStillFails — unknown name → ErrSourceNotFound
  • TestEmptySourcesList — empty config → error
  • All existing ByNames suite tests pass unchanged
go test ./source/... -v -count=1   # all pass
go vet ./source/...                 # clean (pre-existing warning in gateway_httproute_test.go only)

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @nicholasklem!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 10, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @nicholasklem. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 10, 2026
When external-dns is configured with sources whose CRDs are not
installed (e.g., gateway-grpcroute on a cluster without Gateway API),
it crashes on startup with a 60-second informer cache sync timeout.

This changes ByNames() to log a warning and skip sources that fail to
initialize, rather than treating all init errors as fatal. Unknown
source names (ErrSourceNotFound) still hard-fail immediately. If no
sources initialize successfully, an error is returned.

This is the minimal change: ~12 lines in ByNames() with no new types,
no new packages, and no hardcoded source classification.
@nicholasklem nicholasklem force-pushed the fix/graceful-source-init branch from 7b722eb to 2314d9f Compare February 10, 2026 07:56
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Feb 10, 2026
@nicholasklem nicholasklem changed the title fix(source): skip sources that fail to initialize instead of crashing [WIP] fix(source): skip sources that fail to initialize instead of crashing Feb 10, 2026
@nicholasklem nicholasklem changed the title [WIP] fix(source): skip sources that fail to initialize instead of crashing fix(source): skip sources that fail to initialize instead of crashing Feb 10, 2026
@nicholasklem nicholasklem marked this pull request as ready for review February 10, 2026 08:15
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026
@k8s-ci-robot k8s-ci-robot requested a review from vflaux February 10, 2026 08:15
@ivankatliarchuk
Copy link
Member

Thanks for PR.

This is very controversial. I don’t think external-dns should conditionally tolerate user misconfiguration. Failing fast and explicitly on invalid or unsupported sources is preferable to silently skipping them. Otherwise we risk masking real configuration issues and making failures harder to diagnose.

The issue with istio could be not just CRD is missing, but RBAC or some addmission webhook configuration on the cluster.

If someone will try to execute unsupported source, external dns must crash. We are not going to tolerate missconfiguration.

Here are concrete practices users should follow to avoid these failures in production:

  • Avoid relying on "best-effort" behavior
  • Validate configuration early
  • Explicitly scope enabled sources
  • Install CRDs before enabling dependent sources
  • Fail fast in CI
  • Use minimal RBAC and API access checks

Net: production safety should come from clear configs, preflight validation, and environment parity, not from external-dns trying to guess what the user meant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants