Skip to content

Envoy rejects all requests with 503 on aggregate cluster CDS change #11498

Open
@codyl-stripe

Description

@codyl-stripe

Title: Envoy rejects all requests with 503 on aggregate cluster CDS change

Description:

We’re using Envoy 1.13.0.1 with an aggregate cluster, and multiple “child” clusters.
Our aggregate cluster has a basic config which we initially supply using CDS, similar to the example given in the current docs [0].

name: aggregate_cluster
connect_timeout: 0.25s
lb_policy: CLUSTER_PROVIDED
cluster_type:
  name: envoy.clusters.aggregate
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.clusters.aggregate.v3.ClusterConfig
    clusters:
    - a
    - b
    - c

We sometimes want to change the priority levels of the aggregate cluster such as when we want to reduce the priority of cluster “a” to be the lowest. We make this change using CDS and expect the following configuration to be applied by envoy:

name: aggregate_cluster
connect_timeout: 0.25s
lb_policy: CLUSTER_PROVIDED
cluster_type:
  name: envoy.clusters.aggregate
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.clusters.aggregate.v3.ClusterConfig
    clusters:
    - b
    - c
    - a

What we actually experience is envoy will accept this change from the original aggregate cluster to the new one, but will start returning 503s for all requests to the aggregate cluster.

curl -v -X POST http://localhost:17777/check
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 17777 (#0)
> POST /ping HTTP/1.1
> Host: localhost:17777
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 19
< content-type: text/plain
< date: Thu, 04 Jun 2020 22:55:38 GMT
< server: envoy
<
* Connection #0 to host localhost left intact
no healthy upstream

Checking the relevant child cluster stats (with curl http://localhost:9901/clusters | grep health_flags), we see that all child clusters are still healthy and even enabling health check debug logs we see that the child clusters continue to healthcheck healthy:

[2020-06-04 23:01:42.026780] [2020-06-04 23:01:42.026][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28240] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.048145] [2020-06-04 23:01:42.048][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28234] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.059999] [2020-06-04 23:01:42.059][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28233] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.064204] [2020-06-04 23:01:42.064][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28237] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.071922] [2020-06-04 23:01:42.071][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28238] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.151415] [2020-06-04 23:01:42.151][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28247] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.157455] [2020-06-04 23:01:42.157][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28248] hc response=200 health_flags=healthy

So we know that all child clusters still have healthy hosts in them, but envoy continues to assert that there are no healthy hosts when we use the aggregate cluster and returns 503s.
Similarly, the connection pools continue to report that they contain no healthy hosts. Enabling upstream debug logging (with curl -X POST http://localhost:9901/logging?upstream=debug) produces:

[2020-06-04 22:58:47.102169] [2020-06-04 22:58:47.102][148145][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1230] no healthy host for HTTP connection pool
[2020-06-04 22:58:47.302183] [2020-06-04 22:58:47.302][148145][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1230] no healthy host for HTTP connection pool
[2020-06-04 22:58:47.502154] [2020-06-04 22:58:47.502][148145][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1230] no healthy host for HTTP connection pool

Restoring the aggregate cluster to its previous state via another CDS update does not rectify the situation, and we have found that the only way to get envoy out of this state of always returning 503s is to restart the instance.

Relevant Links:
[0] https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/aggregate_cluster#example
Thanks!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions