Envoy rejects all requests with 503 on aggregate cluster CDS change

_Title: Envoy rejects all requests with 503 on aggregate cluster CDS change_

_Description:_

We’re using Envoy 1.13.0.1 with an aggregate cluster, and multiple “child” clusters.
Our aggregate cluster has a basic config which we initially supply using CDS, similar to the example given in the current docs [0].
```yaml
name: aggregate_cluster
connect_timeout: 0.25s
lb_policy: CLUSTER_PROVIDED
cluster_type:
  name: envoy.clusters.aggregate
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.clusters.aggregate.v3.ClusterConfig
    clusters:
    - a
    - b
    - c
```

We sometimes want to change the priority levels of the aggregate cluster such as when we want to reduce the priority of cluster “a” to be the lowest.  We make this change using CDS and expect the following configuration to be applied by envoy:
```yaml
name: aggregate_cluster
connect_timeout: 0.25s
lb_policy: CLUSTER_PROVIDED
cluster_type:
  name: envoy.clusters.aggregate
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.clusters.aggregate.v3.ClusterConfig
    clusters:
    - b
    - c
    - a
```
 
What we actually experience is envoy will accept this change from the original aggregate cluster to the new one, but will start returning 503s for all requests to the aggregate cluster.
```shell
curl -v -X POST http://localhost:17777/check
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 17777 (#0)
> POST /ping HTTP/1.1
> Host: localhost:17777
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 19
< content-type: text/plain
< date: Thu, 04 Jun 2020 22:55:38 GMT
< server: envoy
<
* Connection #0 to host localhost left intact
no healthy upstream
```
 
Checking the relevant child cluster stats (with `curl http://localhost:9901/clusters | grep health_flags`), we see that all child clusters are still healthy and even enabling health check debug logs we see that the child clusters continue to healthcheck healthy:
```
[2020-06-04 23:01:42.026780] [2020-06-04 23:01:42.026][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28240] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.048145] [2020-06-04 23:01:42.048][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28234] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.059999] [2020-06-04 23:01:42.059][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28233] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.064204] [2020-06-04 23:01:42.064][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28237] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.071922] [2020-06-04 23:01:42.071][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28238] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.151415] [2020-06-04 23:01:42.151][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28247] hc response=200 health_flags=healthy
[2020-06-04 23:01:42.157455] [2020-06-04 23:01:42.157][148123][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:264] [C28248] hc response=200 health_flags=healthy
``` 

So we know that all child clusters still have healthy hosts in them, but envoy continues to assert that there are no healthy hosts when we use the aggregate cluster and returns 503s.
Similarly, the connection pools continue to report that they contain no healthy hosts. Enabling upstream debug logging (with `curl -X POST http://localhost:9901/logging?upstream=debug`) produces:
```
[2020-06-04 22:58:47.102169] [2020-06-04 22:58:47.102][148145][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1230] no healthy host for HTTP connection pool
[2020-06-04 22:58:47.302183] [2020-06-04 22:58:47.302][148145][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1230] no healthy host for HTTP connection pool
[2020-06-04 22:58:47.502154] [2020-06-04 22:58:47.502][148145][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:1230] no healthy host for HTTP connection pool
```

Restoring the aggregate cluster to its previous state via another CDS update does not rectify the situation, and we have found that the only way to get envoy out of this state of always returning 503s is to restart the instance. 


Relevant Links:
[0] https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/aggregate_cluster#example
Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Envoy rejects all requests with 503 on aggregate cluster CDS change #11498

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Envoy rejects all requests with 503 on aggregate cluster CDS change #11498

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions