Skip to content

Potential memory leak in v0.20.0 on linux/arm64 with webhook provider #5965

@lexfrei

Description

@lexfrei

Description

I'm observing non-deterministic memory growth in external-dns v0.20.0 on linux/arm64. The external-dns container memory increases from ~14Mi to ~90Mi (6x increase) during initialization and stays elevated until pod restart.

I observed this issue several times initially, but have been unable to reproduce it since. The non-deterministic nature suggests a possible race condition or timing-dependent issue.

Environment

External-DNS:

  • Version: v0.20.0
  • Platform: linux/arm64
  • Image: registry.k8s.io/external-dns/external-dns:v0.20.0

Configuration (actual from running pod):

Sources: [gateway-httproute service]
Interval: 1m0s
MinEventSyncInterval: 5s
Policy: sync
Registry: txt
TXTOwnerID: unifi
Provider: webhook
ProviderCacheTime: 0s
WebhookProviderURL: http://localhost:8888
WebhookProviderReadTimeout: 5s
WebhookProviderWriteTimeout: 10s
AnnotationPrefix: internal-dns/
LogLevel: info
LogFormat: json
MetricsAddress: :7979
DomainFilter: []
ManagedDNSRecordTypes: [A AAAA CNAME]

Kubernetes:

  • Deployment with 2 containers (external-dns + webhook provider)
  • Webhook provider memory: stable 33-34Mi (NOT affected)
  • DNS records managed: ~10 A records

Expected Behavior

External-DNS memory should remain stable around 14-18Mi, similar to what I observed with the webhook provider container which stays consistently at 33-34Mi.

Actual Behavior

Normal state (most of the time):

  • external-dns: 14-18Mi
  • Total pod: 48-52Mi

Problem state (observed several times, cannot reproduce now):

  • external-dns: 90Mi (6x increase!)
  • Total pod: 124Mi
  • Memory stayed elevated until manual pod restart

Important details:

  • All DNS records were already "up to date" - no changes were being made
  • No record manipulations occurred during the high memory state
  • Logs showed only normal operation messages (see below)

Reproducibility

Cannot reliably reproduce:

  • Observed the issue several times on fresh pod starts
  • After manual restarts, sometimes reproduced, sometimes didn't
  • Ran 10 consecutive pod restarts as a test - all showed normal memory (14-18Mi)
  • Problem has not recurred since initial observations

This non-deterministic behavior suggests a race condition or state-dependent issue.

Logs

Logs were completely clean during both normal and high-memory states. No errors, warnings, or unusual messages:

{"level":"info","msg":"All records are already up to date"}

Repeated every minute. No webhook errors, API errors, retries, or any indication of problems.

The clean logs are particularly notable because:

  1. No record changes were happening
  2. No errors to trigger retries or buffering
  3. External-DNS reported normal operation while using 6x memory

Investigation Performed

  1. Webhook provider: Memory stable at 33-34Mi in all cases, logs clean
  2. Configuration: ProviderCacheTime: 0s means no webhook response caching
  3. Go memstats (when operating normally at 14Mi):
    • go_memstats_alloc_bytes: 6.2MB
    • go_memstats_heap_inuse_bytes: 9.6MB
    • go_memstats_stack_inuse_bytes: 1.1MB
  4. Restart behavior: Problem cleared immediately on pod restart
  5. Logs: Clean in both states - no errors or warnings at any point

Hypothesis

Since the webhook provider memory remains stable and there's no caching (ProviderCacheTime: 0s), the issue appears to be in external-dns internal components, possibly:

  • Kubernetes informers (gateway-httproute, service, pods, nodes, namespaces, endpointslices)
  • Platform-specific issue (linux/arm64)
  • Race condition during initialization
  • Regression from v0.19.0 (which had significant memory improvements)

Questions

  1. Are there known issues with v0.20.0 on arm64?
  2. Have others reported similar memory behavior with v0.20.0?
  3. Any known race conditions in informer initialization that could cause this?

Additional Context

  • I can provide heap dumps and goroutine dumps if the problem reproduces
  • Willing to test patches or provide additional diagnostics
  • Problem is not critical (pod still functional, restart resolves it)
  • Unable to reproduce on demand, so cannot test downgrade scenarios

Related


Filing this issue despite inability to reproduce, in case others encounter the same behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions