coredns OOMKilled when using this plugin

hello, I am trying to use this plugin, but my coredns pods get OOMKilled. I am probably mis-configuring it, possibly creating a loop... I'd like someone to review my config and possibly help me troubleshoot.

I have three clusters each with a modified coredns config. This is the config, this is one of them as an example:
```
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        rewrite name substring cluster.cluster1 cluster.local
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
    }

    cluster.cluster2:53 {
        rewrite name substring cluster.cluster2 cluster.local

        forward . ${cluster2_coredns_ip}:53 {
            expire 10s
            policy round_robin
        }
        cache 10
    }

    cluster.cluster3:53 {
        rewrite name substring cluster.cluster3 cluster.local

        forward . ${cluster3_coredns_ip}:53 {
            expire 10s
            policy round_robin
        }
        cache 10
    }

    cluster.all:53 {
      gathersrv cluster.all. {
          cluster.cluster1. c1-
          cluster.cluster2. c2-
          cluster.cluster3. c3-
      }
      forward . 127.0.0.1:53
    } 
```
so `cluster.local` is the local cluster, `cluster.cluster[1..3]` is rewritten as `cluster.local` and forwarded to the pertinent coredns. Finally `cluster.all` should gather srv records from all of the clusters.

pointing to cluster1 coredns IP, I can resolve `_peers._tcp.etcd-headless.h2.svc.cluster.local`:

```
 dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.local

; <<>> DiG 9.18.24 <<>> @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.local
; (1 server found)
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27603
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: f47ee6beb5803a4b (echoed)
;; QUESTION SECTION:
;_peers._tcp.etcd-headless.h2.svc.cluster.local.	IN SRV

;; ANSWER SECTION:
_peers._tcp.etcd-headless.h2.svc.cluster.local.	30 IN SRV 0 100 2379 etcd-headless.h2.svc.cluster.local.

;; ADDITIONAL SECTION:
etcd-headless.h2.svc.cluster.local. 30 IN A	10.96.0.42

;; Query time: 5 msec
;; SERVER: 10.89.0.225#53(10.89.0.225) (UDP)
;; WHEN: Tue Apr 02 12:48:12 EDT 2024
;; MSG SIZE  rcvd: 237
```
and resolve `_peers._tcp.etcd-headless.h2.svc.cluster.cluster1`:
```
dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster1

; <<>> DiG 9.18.24 <<>> @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster1
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3306
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: e8816b2d226c93c1 (echoed)
;; QUESTION SECTION:
;_peers._tcp.etcd-headless.h2.svc.cluster.cluster1. IN SRV

;; ANSWER SECTION:
_peers._tcp.etcd-headless.h2.svc.cluster.local.	30 IN SRV 0 100 2379 etcd-headless.h2.svc.cluster.local.

;; ADDITIONAL SECTION:
etcd-headless.h2.svc.cluster.local. 30 IN A	10.96.0.42

;; Query time: 2 msec
;; SERVER: 10.89.0.225#53(10.89.0.225) (UDP)
;; WHEN: Tue Apr 02 12:49:10 EDT 2024
;; MSG SIZE  rcvd: 240
```
which result in the same response, correctly so.
I can also try with cluster2:
```
dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster2

; <<>> DiG 9.18.24 <<>> @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.cluster2
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42009
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: d5ce2f5ab20abd3b (echoed)
;; QUESTION SECTION:
;_peers._tcp.etcd-headless.h2.svc.cluster.cluster2. IN SRV

;; ANSWER SECTION:
_peers._tcp.etcd-headless.h2.svc.cluster.local.	10 IN SRV 0 100 2379 etcd-headless.h2.svc.cluster.local.

;; ADDITIONAL SECTION:
etcd-headless.h2.svc.cluster.local. 10 IN A	10.96.1.114

;; Query time: 9 msec
;; SERVER: 10.89.0.225#53(10.89.0.225) (UDP)
;; WHEN: Tue Apr 02 12:50:49 EDT 2024
;; MSG SIZE  rcvd: 240
```
which still works but it is resolved to a different IP.
however if I try cluster.all:
```
dig @10.89.0.225 -t SRV _peers._tcp.etcd-headless.h2.svc.cluster.all
;; communications error to 10.89.0.225#53: timed out
```
I get a timeout and generate an OOMKilled for the coredns pod.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coredns OOMKilled when using this plugin #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

coredns OOMKilled when using this plugin #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions