Endpoint updates during initialisation can stall initialisation

### Description
If an endpoint is removed from a cluster with active health checking enabled, during initialisation then initialisation can block for at least no_traffic_interval (or forever if it was the only endpoint in the cluster.)

### Repro steps
- Configure envoy with an actively health-checked EDS cluster with >0 endpoints.
- During cluster initialisation at startup, send an EDS update removing an endpoint from the cluster.

Repro cases attached: [envoy-gets-stuck.tar.gz](https://github.com/envoyproxy/envoy/files/6325171/envoy-gets-stuck.tar.gz)

The repro case uses endpoints that will cause the health checks to timeout, this is in order to extend the window in which the problem can be triggered. This issue can (and does) still happen even with real endpoints.

#### Expected outcome:
Cluster initialisation finishes when the initial health checks for the remaining endpoints complete.

#### Actual outcome:
Cluster initialisation requires a second round of health checks to be scheduled, delaying initialisation for no_traffic_interval or, if there are no endpoints remaining, forever.

This happens because [ClusterImplBase's pending_initialize_health_checks_](https://github.com/envoyproxy/envoy/blob/main/source/common/upstream/upstream_impl.cc#L1087-L1104) is set to the number of hosts that were originally in the cluster. When an endpoint is removed its health check is cancelled, but the counter is not decremented. In order to reach 0, and complete initialisation, some of the remaining endpoints will need to execute their next scheduled health check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Endpoint updates during initialisation can stall initialisation #16035

Description

Repro steps

Expected outcome:

Actual outcome:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Endpoint updates during initialisation can stall initialisation #16035

Description

Description

Repro steps

Expected outcome:

Actual outcome:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions