Skip to content

Endpoint updates during initialisation can stall initialisation #16035

Open
@JonathanO

Description

@JonathanO

Description

If an endpoint is removed from a cluster with active health checking enabled, during initialisation then initialisation can block for at least no_traffic_interval (or forever if it was the only endpoint in the cluster.)

Repro steps

  • Configure envoy with an actively health-checked EDS cluster with >0 endpoints.
  • During cluster initialisation at startup, send an EDS update removing an endpoint from the cluster.

Repro cases attached: envoy-gets-stuck.tar.gz

The repro case uses endpoints that will cause the health checks to timeout, this is in order to extend the window in which the problem can be triggered. This issue can (and does) still happen even with real endpoints.

Expected outcome:

Cluster initialisation finishes when the initial health checks for the remaining endpoints complete.

Actual outcome:

Cluster initialisation requires a second round of health checks to be scheduled, delaying initialisation for no_traffic_interval or, if there are no endpoints remaining, forever.

This happens because ClusterImplBase's pending_initialize_health_checks_ is set to the number of hosts that were originally in the cluster. When an endpoint is removed its health check is cancelled, but the counter is not decremented. In order to reach 0, and complete initialisation, some of the remaining endpoints will need to execute their next scheduled health check.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions