-
Notifications
You must be signed in to change notification settings - Fork 92
Reconfigure API timeouts #617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
packages/cluster/network/main.tf
Outdated
url_map = google_compute_url_map.orch_map.self_link | ||
name = "${var.prefix}https-proxy" | ||
url_map = google_compute_url_map.orch_map.self_link | ||
http_keep_alive_timeout_sec = 540 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to change this? can this be changed without any disruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this the LB client facing part has a default timeout of 610s (that can be configured), but the backend has reconfigurable timeout of 600s. I don't know why these are the defaults, because of the issue this can cause:
The reason to set the timeout this way (rising going upstream) are here.
This applies to both API and sandbox traffic.
This could be even lower (300 might be ok too), but 540 seems reasonable.
As for the reload without disruption—it should be just part of Envoy config, so most likely yes, but I have not tested this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's misunderstanding, I think the default Load Balancer timeout is higher than the one to service on purpose, so it can close the connection to backend gracefully and to maximize the connection length to client, this should reduce latency
Time →──────────────────────────────────────────────────────────────────────────────▶
• Client–GFE Idle Keep-Alive Timeout = 610 s
• GFE–Backend Idle Keep-Alive Timeout = 600 s (fixed)
• Your Server Idle Keep-Alive Timeout > 620 s
─── At 0 s ────────────────────────────────────────────────────────────────────────
Client opens a TCP connection to GFE
GFE opens a TCP connection to your server
(Both sides handshake, HTTP keep-alive enabled)
─── Idle, no requests for 600 s ─────────────────────────────────────────────────
At **600 s**, GFE notices its **backend socket** has been idle for 600 s → GFE sends a **FIN** to your server and closes that backend TCP connection.
Your server (with 620 s idle config) has not yet closed its side, so it does a **graceful close** when it sees GFE’s FIN.
─── Idle from 600 s to 605 s ──────────────────────────────────────────────────
Client ↔ GFE connection still open (it only times out at 610 s)
Backend connection is now closed on both ends (GFE closed it at 600 s; server followed suit)
─── At **605 s**, client issues a new HTTP request over the **existing** TCP connection to GFE
1. **GFE receives** the request on the client socket (still alive).
2. GFE looks for an **idle backend socket** to forward on—but it closed that at 600 s.
3. GFE immediately **opens a brand-new TCP connection** to your server (SYN → SYN-ACK → ACK).
4. GFE **forwards** the client’s request over that new connection.
5. Your server processes and replies; GFE relays the response back to the client.
─── Connection now active again; both sides may keep it alive for their configured idle timeouts ────▶
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be right, the fact that they did set it up this way by default is at least suspicious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's scrap this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking, if we should configure our proxies to also have (620s-610s, 630s-620s) timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really thinking if the LB's 610s-600s (fixed) have any advantage, because the client's (-> LB) timeout should already be <610s, so the connection would still be closed from the client side at that point.
I'm just thinking if there is a point of making the "client" part of each proxy timeout lower than the "server"—wouldn't it work equally well if the "client" part timeout was slightly higher (or the same, doesn't matter in the end) too if the next server timeout was still bigger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two idle timeouts, downstream and upstream and the rule for (downstream > upstream) is mainly for the same service.
If you close the downstream connection first, the new downstream connection can try to reuse the TCP connection even though it's already in CLOSE_WAIT
state, because transport doesn't check the connection state.
Doing this results in error.
- For different services you are trying to prevent both services sending
FIN
at the same time, it's a good practice to add a cushion towards upstream, so you don't have to recreate these connections. Theoretically you could have it the other way around
So the important idle timeout is between server and client in the same service, I committed it with a comment
I also wanted to mention a few relevant things to the setup
|
Co-authored-by: Jakub Novák <[email protected]>
Co-authored-by: Jakub Novák <[email protected]>
I lowered the timeouts in proxies to take into account the +10s so the timeout of the orchestrator proxy server is not the same as the changed timeout of the envd server (https://github.com/e2b-dev/infra/pull/616/files). |
The read header timeout lower than the backend timeout and the idle timeout lower than the LB idle timeout can be causing connection problems.