Skip to content

coredns OOMKilled when using this plugin #14

@raffaelespazzoli

Description

@raffaelespazzoli

hello, I am trying to use this plugin, but my coredns pods get OOMKilled. I am probably mis-configuring it, possibly creating a loop... I'd like someone to review my config and possibly help me troubleshoot.

I have three clusters each with a modified coredns config. This is the config, this is one of them as an example:

    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        rewrite name substring cluster.cluster1 cluster.local
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

    cluster.cluster2:53 {
        rewrite name substring cluster.cluster2 cluster.local

        forward . ${cluster2_coredns_ip}:53 {
            expire 10s
            policy round_robin
        }
        cache 10
    }

    cluster.cluster3:53 {
        rewrite name substring cluster.cluster3 cluster.local

        forward . ${cluster3_coredns_ip}:53 {
            expire 10s
            policy round_robin
        }
        cache 10
    }

    cluster.all:53 {
      gathersrv cluster.all. {
          cluster.cluster1. c1-
          cluster.cluster2. c2-
          cluster.cluster3. c3-
      }
      forward . 127.0.0.1:53
    } 

so cluster.local is the local cluster, cluster.cluster[1..3] is rewritten as cluster.local and forwarded to the pertinent coredns. Finally cluster.all should gather srv records from all of the clusters.

pointing to cluster1 coredns IP, I can resolve _peers._tcp.etcd-headless.h2.svc.cluster.local:

 dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.local

; <<>> DiG 9.18.24 <<>> @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27603
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: f47ee6beb5803a4b (echoed)
;; QUESTION SECTION:
;_peers._tcp.etcd-headless.h2.svc.cluster.local.	IN SRV

;; ANSWER SECTION:
_peers._tcp.etcd-headless.h2.svc.cluster.local.	30 IN SRV 0 100 2379 etcd-headless.h2.svc.cluster.local.

;; ADDITIONAL SECTION:
etcd-headless.h2.svc.cluster.local. 30 IN A	10.96.0.42

;; Query time: 5 msec
;; SERVER: 10.89.0.225#53(10.89.0.225) (UDP)
;; WHEN: Tue Apr 02 12:48:12 EDT 2024
;; MSG SIZE  rcvd: 237

and resolve _peers._tcp.etcd-headless.h2.svc.cluster.cluster1:

dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster1

; <<>> DiG 9.18.24 <<>> @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster1
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3306
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: e8816b2d226c93c1 (echoed)
;; QUESTION SECTION:
;_peers._tcp.etcd-headless.h2.svc.cluster.cluster1. IN SRV

;; ANSWER SECTION:
_peers._tcp.etcd-headless.h2.svc.cluster.local.	30 IN SRV 0 100 2379 etcd-headless.h2.svc.cluster.local.

;; ADDITIONAL SECTION:
etcd-headless.h2.svc.cluster.local. 30 IN A	10.96.0.42

;; Query time: 2 msec
;; SERVER: 10.89.0.225#53(10.89.0.225) (UDP)
;; WHEN: Tue Apr 02 12:49:10 EDT 2024
;; MSG SIZE  rcvd: 240

which result in the same response, correctly so.
I can also try with cluster2:

dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster2

; <<>> DiG 9.18.24 <<>> @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster2
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42009
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: d5ce2f5ab20abd3b (echoed)
;; QUESTION SECTION:
;_peers._tcp.etcd-headless.h2.svc.cluster.cluster2. IN SRV

;; ANSWER SECTION:
_peers._tcp.etcd-headless.h2.svc.cluster.local.	10 IN SRV 0 100 2379 etcd-headless.h2.svc.cluster.local.

;; ADDITIONAL SECTION:
etcd-headless.h2.svc.cluster.local. 10 IN A	10.96.1.114

;; Query time: 9 msec
;; SERVER: 10.89.0.225#53(10.89.0.225) (UDP)
;; WHEN: Tue Apr 02 12:50:49 EDT 2024
;; MSG SIZE  rcvd: 240

which still works but it is resolved to a different IP.
however if I try cluster.all:

dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.all
;; communications error to 10.89.0.225#53: timed out

I get a timeout and generate an OOMKilled for the coredns pod.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions