Skip to content

The requests may fail when on demand CDS returns clusters #20873

Open
@lambdai

Description

@lambdai

Description:

Upon OnDemand cluster returns an available cluster, the requests waiting for that cluster may fail due to hosts are not added to the cluster.

Detected by Test case TcpProxyOdcdsIntegrationTest, SingleTcpClient
https://github.com/envoyproxy/envoy/blob/main/test/integration/tcp_proxy_odcds_integration_test.cc#L130

Background
Currently a cluster is fully functional after cluster is warmed up and host members is propagated to worker thread.

The former enables obtain a ThreadLocalCluster by the name of the cluster.
The latter supports LB when a upstream connection is needed by a router.

Prior to on-demand CDS, the two phases are distinguished by the error details but not many users need to understand the concrete reason.

However, in on-demand CDS, the expectation is a little different. The downstream filter is expected to be waiting until the cluster is fully ready.

Root Cause

From main thread perspective, the first host member update and the resumption of the router filter are concurrent at work threads.
Chances are the router filter are resumed before the first member is delivered,
thus the first bunch of requests using on-demand CDS are failing because of "no healthy upstream host".

Proposal
I am considering adding another API to cluster manager, namely

NewHostCallback ThreadLocalCluster::waitForNewHost()

This new function can be deemed as an extended
ClusterDiscoveryCallbackHandlePtr requestOnDemandClusterDiscovery()
that addressed the issue.

The current requestOnDemandClusterDiscovery calls this new waitForNewHost()
and hide the details of first host update.

This API could be adopted even if the cluster is not on-demand. There are known cases that all the hosts are removed during the cluster update and retry policy is not helping

Alternatives
Consider the above unlucky sequences as a known failure and improve the each retry policy (of each protocol) to handle it.
Currently TcpProxy and HCM fail fast on this condition.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions