Skip to content

Behavior of Envoy proxy on Consul node entering maintenance mode #39052

Closed as not planned
@AntoineDuComptoirDesPharmacies

Description

Title: Behavior of Envoy proxy on Consul node entering maintenance mode

Description:
In our infrastructure, the containers can receive SIGTERM to free up the machine where it run (also known as Fargaite Spot Interruption).
In this situation, we set the node as "maintenance_mode" in consul cluster using the command line :
consul maint -enable

However, at this exact moment, some of our user requests receive 503 HTTP status code, even if other nodes of the same service are currently running and can serve the request.
Our ideas about this behavior are the following :

Idea1 :
As we are shutting down the app straight away after entering maintenance mode, if entering maintenance mode is asynchronous, we are killing the service to early (as it can still receive requests from the outside).
Nothing in Consul documentation indicate asynchronous about this command but as we are in a cluster, we imagine that the information may take some times to propagate ?

Idea 2 :
Even if entering maintenance mode is synchronous in consul cluster, maybe the Envoy sidecar still have the associated target IP in cache and may forward some request to this IP during few seconds.
In this second idea, we imagine that we are getting 503 according to documenation about "x-envoy-overloaded" : https://envoy_examples.storage.googleapis.com/envoy-v2-docs/docs/configuration/http_filters/router_filter.html#id13

Idea 3:
Both idea 1 and 2 in the same time ?

In this situation, what could be the good thing to do ?
Solution 1 :
We were thinking about adding a retry policy on Envoy about '503' http status code, that will result in sending the request to the other nodes (which are available).
However it seems a bit like a workaround to a more logical problem.
Moreover, how to distinguish 503 from 'maintenance_mode' and 503 that can be real answer from our REST API ?

Solution 2;
Maybe make the node leave the cluster prior to entering maintenance mode, using the CLI "consul leave" ?
But is this synchronous in the cluster ? Should we wait some seconds before entering maintenance mode ?
Is there any meaning entering maintenance mode after node leave the cluster ?
More important : When we leave a cluster, can the local envoy sidecar still forward local request to other services of the mesh ?

Relevant Links:
https://developer.hashicorp.com/consul/commands/maint
#6930
https://developer.hashicorp.com/consul/commands/leave

Thanks in advance for your help in understanding this behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/load balancingquestionQuestions that are neither investigations, bugs, nor enhancementsstalestalebot believes this issue/PR has not been touched recently

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions