Skip to content

Bug: SDS never delivers client certificate secret when it is exclusively referenced by a SecurityPolicy extAuth Backend #8616

@gkoppura-github

Description

@gkoppura-github

Description

When an EG Backend resource's spec.tls.clientCertificateRef secret is only referenced from a SecurityPolicy extAuth (or gRPC auth) backendRef, Envoy subscribes to the secret via SDS but the EG controller never pushes the secret payload. The secret is permanently stuck in dynamic_warming_secrets with version_info: "uninitialized", causing every outbound TLS handshake to the auth backend to fail with cx_connect_fail.

Expected: The client certificate secret is included in the xDS snapshot and Envoy can complete the mTLS handshake to the ext auth backend.

Actual: sds.<namespace>/<secret-name>.init_fetch_timeout: 1, update_success: 0 — the secret never leaves WARMING state and all connections to the ext auth backend fail.

The bug is not a race condition. The WARMING state persists indefinitely (verified over 80+ seconds).

Root cause (code-level): The xDS translator has two separate code paths for adding client-cert secrets to the snapshot:

  1. processHTTPListenerXdsTranslation (translator.go) — walks route.Destination.Settings[*].TLS.ClientCertificates and calls buildXdsTLSCertSecret + tCtx.AddXdsResource for every route backend. ✅ Works.

  2. (*extAuth).patchResources (extauth.go) — calls createExtServiceXDSCluster which internally calls addXdsCluster. addXdsCluster pushes the CA cert (buildXdsUpstreamTLSCASecret) but never calls buildXdsTLSCertSecret for the client cert. ❌ Missing.

Envoy receives an SDS subscription config referencing the secret name (so the secret appears in dynamic_warming_secrets) but the controller never sends the actual key/cert payload.


Repro Steps

Prerequisites

  • A Kubernetes cluster with Envoy Gateway v1.7.1 installed
  • Two backend services (myapp and authz) that require mutual TLS (i.e., they verify the client certificate presented by the gateway)
  • Three secrets in the same namespace:
    • myapp-tls — leaf TLS cert/key for the gateway to present to myapp
    • authz-tls — leaf TLS cert/key for the gateway to present to authz; this is the secret that will get stuck
    • ca-bundle — CA certificate used to verify both backend server certs

Minimal Reproducing Config

---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: eg
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: eg
  namespace: default
spec:
  gatewayClassName: eg
  listeners:
  - name: http
    port: 80
    protocol: HTTP
---
# Backend for the main application — its clientCertificateRef (myapp-tls) will
# be referenced by an HTTPRoute, so EG pushes it correctly. Acts as a control.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: myapp-backend
  namespace: default
spec:
  endpoints:
  - fqdn:
      hostname: myapp.default.svc.cluster.local
      port: 8443
  tls:
    caCertificateRefs:
    - name: ca-bundle
      group: ''
      kind: Secret
    clientCertificateRef:
      name: myapp-tls   # ← this secret IS pushed to SDS (routed via HTTPRoute)
      group: ''
      kind: Secret
    sni: myapp.default.svc.cluster.local
---
# Backend for the ext auth service — its clientCertificateRef (authz-tls) is
# ONLY referenced from a SecurityPolicy extAuth backendRef. This secret will
# be STUCK in SDS WARMING state.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: authz-backend
  namespace: default
spec:
  endpoints:
  - fqdn:
      hostname: authz.default.svc.cluster.local
      port: 8443
  tls:
    caCertificateRefs:
    - name: ca-bundle
      group: ''
      kind: Secret
    clientCertificateRef:
      name: authz-tls   # ← this secret is NEVER pushed to SDS (only extAuth ref)
      group: ''
      kind: Secret
    sni: authz.default.svc.cluster.local
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: myapp
  namespace: default
spec:
  parentRefs:
  - name: eg
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: "/"
    backendRefs:
    - name: myapp-backend
      group: gateway.envoyproxy.io
      kind: Backend
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: myapp-auth
  namespace: default
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: HTTPRoute
    name: myapp
  extAuth:
    http:
      backendRefs:
      - name: authz-backend
        namespace: default
        group: gateway.envoyproxy.io
        kind: Backend

Verification commands

# 1. Apply the config
kubectl apply -f repro.yaml

# 2. Wait for the Gateway to become Ready, then get the Envoy proxy pod name
ENVOY_POD=$(kubectl get pods -n envoy-gateway-system \
  -l 'app.kubernetes.io/component=proxy' \
  -o jsonpath='{.items[0].metadata.name}')

# 3. Port-forward the Envoy admin interface
kubectl port-forward -n envoy-gateway-system pod/$ENVOY_POD 19000:19000 &

# 4. Check SDS stats — authz-tls will show:
#      init_fetch_timeout: 1
#      update_success: 0
# while myapp-tls will show update_success: 1
curl -s http://localhost:19000/stats | grep "^sds\."

# 5. Check dynamic_warming_secrets in the config dump
curl -s http://localhost:19000/config_dump | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
for section in d.get('configs', []):
    if section.get('@type','').endswith('SecretsConfigDump'):
        warming = [(s['name'], s.get('version_info','')) for s in section.get('dynamic_warming_secrets',[])]
        print('WARMING:', warming)
        active  = [s['name'] for s in section.get('dynamic_active_secrets',[])]
        print('ACTIVE:', active)
"
# Expected output:
#   WARMING: [('default/authz-tls', 'uninitialized')]
#   ACTIVE: ['default/myapp-tls', ...]

# 6. To confirm it is not a race condition, wait 60 seconds and re-run step 4.
# update_success for authz-tls will remain 0.

Confirming the broken connection

# The ext_authz cluster will show zero successful handshakes and many connect failures
curl -s http://localhost:19000/stats | grep -E "extauth.*ssl\.handshake|extauth.*cx_connect_fail"
# Expected:
#   cluster.securitypolicy/default/myapp-auth/extauth/0.ssl.handshake: 0
#   cluster.securitypolicy/default/myapp-auth/extauth/0::cx_connect_fail: <N>

Workaround

Point authz-backend.spec.tls.clientCertificateRef to any secret that is also used as a clientCertificateRef on a Backend referenced by an HTTPRoute. This causes the HTTPRoute code path to push the secret into the snapshot, and the ext auth cluster shares it via the SDS name lookup.

# Workaround: reuse myapp-tls (already pushed by the HTTPRoute path)
clientCertificateRef:
  name: myapp-tls   # works even though it's a different identity cert
  group: ''
  kind: Secret

Environment

Envoy Gateway version v1.7.1 (docker.io/envoyproxy/gateway:v1.7.1)
Kubernetes version v1.31 (LKE)
GatewayClass controller gateway.envoyproxy.io/gatewayclass-controller
Affected resources Backend + SecurityPolicy (extAuth HTTP and gRPC)
First affected version Unknown — reproduced on v1.7.1; not tested on earlier versions

Logs

EG controller — the secret is reconciled without error but the snapshot push never includes the client cert:

{"level":"info","msg":"processing Secret authz-tls","namespace":"default","name":"authz-tls"}
{"level":"info","msg":"processing Secret authz-tls","namespace":"default","name":"authz-tls"}
# (repeats on every reconcile loop — no errors, but no snapshot push)

Envoy stats (after 60+ seconds of uptime):

sds.default/authz-tls.init_fetch_timeout: 1
sds.default/authz-tls.update_attempt:     1
sds.default/authz-tls.update_success:     0   ← permanently stuck

sds.default/myapp-tls.init_fetch_timeout: 0
sds.default/myapp-tls.update_success:     1   ← correctly delivered

Envoy config dump:

WARMING: [('default/authz-tls', 'uninitialized')]
ACTIVE:  ['default/myapp-tls', 'default/ca-bundle', ...]

Code Pointer

The fix should add client-cert snapshot injection in the extAuth cluster creation path, mirroring what processHTTPListenerXdsTranslation already does for route backends:

internal/xds/translator/translator.go — the working pattern (already there for HTTPRoute):

// add http route client certs
for _, route := range httpListener.Routes {
    if route.Destination != nil {
        for _, st := range route.Destination.Settings {
            if st.TLS != nil {
                for _, cert := range st.TLS.ClientCertificates {
                    secret := buildXdsTLSCertSecret(&cert)
                    tCtx.AddXdsResource(resourcev3.SecretType, secret)
                }
            }
        }
    }
}

internal/xds/translator/extauth.gopatchResources or internal/xds/translator/utils.goaddXdsCluster needs the equivalent loop for TLS.ClientCertificates on each DestinationSetting.

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions