Behavior of Envoy proxy on Consul node entering maintenance mode

*Title*: Behavior of Envoy proxy on Consul node entering maintenance mode

*Description*:
In our infrastructure, the containers can receive SIGTERM to free up the machine where it run (also known as Fargaite Spot Interruption).
In this situation, we set the node as "maintenance_mode" in consul cluster using the command line : 
`consul maint -enable`

However, at this exact moment, some of our user requests receive 503 HTTP status code, even if other nodes of the same service are currently running and can serve the request.
Our ideas about this behavior are the following : 

**Idea1 :** 
As we are shutting down the app straight away after entering maintenance mode, if entering maintenance mode is asynchronous, we are killing the service to early (as it can still receive requests from the outside).
Nothing in Consul documentation indicate asynchronous about this command but as we are in a cluster, we imagine that the information may take some times to propagate ?

**Idea 2 :** 
Even if entering maintenance mode is synchronous in consul cluster, maybe the Envoy sidecar still have the associated target IP in cache and may forward some request to this IP during few seconds.
In this second idea, we imagine that we are getting 503 according to documenation about "x-envoy-overloaded" : https://envoy_examples.storage.googleapis.com/envoy-v2-docs/docs/configuration/http_filters/router_filter.html#id13

**Idea 3:**
Both idea 1 and 2 in the same time ?

In this situation, what could be the good thing to do ?
**Solution 1 :**
We were thinking about adding a retry policy on Envoy about '503' http status code, that will result in sending the request to the other nodes (which are available).
However it seems a bit like a workaround to a more logical problem.
Moreover, how to distinguish 503 from 'maintenance_mode' and 503 that can be real answer from our REST API ?

**Solution 2;**
Maybe make the node leave the cluster prior to entering maintenance mode, using the CLI "consul leave" ?
But is this synchronous in the cluster ? Should we wait some seconds before entering maintenance mode ?
Is there any meaning entering maintenance mode after node leave the cluster ?
More important : When we leave a cluster, can the local envoy sidecar still forward local request to other services of the mesh ?

*Relevant Links*:
https://developer.hashicorp.com/consul/commands/maint
https://github.com/envoyproxy/envoy/issues/6930
https://developer.hashicorp.com/consul/commands/leave

Thanks in advance for your help in understanding this behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Behavior of Envoy proxy on Consul node entering maintenance mode #39052

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Behavior of Envoy proxy on Consul node entering maintenance mode #39052

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions