Reconfigure API timeouts #617

ValentaTomas · 2025-05-05T22:51:29Z

The read header timeout lower than the backend timeout and the idle timeout lower than the LB idle timeout can be causing connection problems.

jakubno · 2025-05-06T08:16:43Z

packages/cluster/network/main.tf

-  url_map = google_compute_url_map.orch_map.self_link
+  name                        = "${var.prefix}https-proxy"
+  url_map                     = google_compute_url_map.orch_map.self_link
+  http_keep_alive_timeout_sec = 540


do we need to change this? can this be changed without any disruption?

Based on this the LB client facing part has a default timeout of 610s (that can be configured), but the backend has reconfigurable timeout of 600s. I don't know why these are the defaults, because of the issue this can cause:
The reason to set the timeout this way (rising going upstream) are here.
This applies to both API and sandbox traffic.

This could be even lower (300 might be ok too), but 540 seems reasonable.

As for the reload without disruption—it should be just part of Envoy config, so most likely yes, but I have not tested this.

I think there's misunderstanding, I think the default Load Balancer timeout is higher than the one to service on purpose, so it can close the connection to backend gracefully and to maximize the connection length to client, this should reduce latency

Time →──────────────────────────────────────────────────────────────────────────────▶ • Client–GFE Idle Keep-Alive Timeout = 610 s • GFE–Backend Idle Keep-Alive Timeout = 600 s (fixed) • Your Server Idle Keep-Alive Timeout > 620 s ─── At 0 s ──────────────────────────────────────────────────────────────────────── Client opens a TCP connection to GFE GFE opens a TCP connection to your server (Both sides handshake, HTTP keep-alive enabled) ─── Idle, no requests for 600 s ───────────────────────────────────────────────── At **600 s**, GFE notices its **backend socket** has been idle for 600 s → GFE sends a **FIN** to your server and closes that backend TCP connection. Your server (with 620 s idle config) has not yet closed its side, so it does a **graceful close** when it sees GFE’s FIN. ─── Idle from 600 s to 605 s ────────────────────────────────────────────────── Client ↔ GFE connection still open (it only times out at 610 s) Backend connection is now closed on both ends (GFE closed it at 600 s; server followed suit) ─── At **605 s**, client issues a new HTTP request over the **existing** TCP connection to GFE 1. **GFE receives** the request on the client socket (still alive). 2. GFE looks for an **idle backend socket** to forward on—but it closed that at 600 s. 3. GFE immediately **opens a brand-new TCP connection** to your server (SYN → SYN-ACK → ACK). 4. GFE **forwards** the client’s request over that new connection. 5. Your server processes and replies; GFE relays the response back to the client. ─── Connection now active again; both sides may keep it alive for their configured idle timeouts ────▶

You might be right, the fact that they did set it up this way by default is at least suspicious.

Let's scrap this for now.

Thinking, if we should configure our proxies to also have (620s-610s, 630s-620s) timeouts.

I'm really thinking if the LB's 610s-600s (fixed) have any advantage, because the client's (-> LB) timeout should already be <610s, so the connection would still be closed from the client side at that point.
I'm just thinking if there is a point of making the "client" part of each proxy timeout lower than the "server"—wouldn't it work equally well if the "client" part timeout was slightly higher (or the same, doesn't matter in the end) too if the next server timeout was still bigger?

There are two idle timeouts, downstream and upstream and the rule for (downstream > upstream) is mainly for the same service.

If you close the downstream connection first, the new downstream connection can try to reuse the TCP connection even though it's already in CLOSE_WAIT state, because transport doesn't check the connection state.

Doing this results in error.

For different services you are trying to prevent both services sending FIN at the same time, it's a good practice to add a cushion towards upstream, so you don't have to recreate these connections. Theoretically you could have it the other way around

So the important idle timeout is between server and client in the same service, I committed it with a comment

packages/api/main.go

ValentaTomas · 2025-05-08T00:57:11Z

I also wanted to mention a few relevant things to the setup

Previously, in the code interpreter, the idle timeout was set to a small default value: Configure code interpreter server idle timeout code-interpreter#102
We are also missing the manual "ping" in the code interpreter that in envd keeps the connection open
I think the default behavior of nodejs fetch changed and in the latest version it might be using keepalive—we should ensure it is configured the same in all our our SDKs (API+envd client, Python+JS)

packages/shared/pkg/proxy/proxy.go

packages/api/main.go

packages/cluster/network/main.tf

Co-authored-by: Jakub Novák <[email protected]>

ValentaTomas · 2025-05-08T13:44:48Z

I lowered the timeouts in proxies to take into account the +10s so the timeout of the orchestrator proxy server is not the same as the changed timeout of the envd server (https://github.com/e2b-dev/infra/pull/616/files).

Reconfigure api timeouts

771e474

ValentaTomas self-assigned this May 5, 2025

ValentaTomas requested review from jakubno and dobrac as code owners May 5, 2025 22:51

ValentaTomas added bug Something isn't working improvement Improvement for current functionality labels May 5, 2025

ValentaTomas assigned jakubno May 5, 2025

jakubno reviewed May 6, 2025

View reviewed changes

ValentaTomas commented May 7, 2025

View reviewed changes

packages/api/main.go Outdated Show resolved Hide resolved

ValentaTomas commented May 7, 2025

View reviewed changes

packages/api/main.go Show resolved Hide resolved

Add buffet between upstream and downstream

6fca79a

jakubno reviewed May 8, 2025

View reviewed changes

packages/shared/pkg/proxy/proxy.go Outdated Show resolved Hide resolved

jakubno added 3 commits May 8, 2025 03:03

Update packages/shared/pkg/proxy/proxy.go

7a033a1

Improve comment

a287207

Add comments to idle timeouts

7ec208f

jakubno reviewed May 8, 2025

View reviewed changes

packages/api/main.go Outdated Show resolved Hide resolved

jakubno reviewed May 8, 2025

View reviewed changes

packages/cluster/network/main.tf Outdated Show resolved Hide resolved

ValentaTomas and others added 3 commits May 8, 2025 06:35

Update packages/cluster/network/main.tf

83cfb47

Co-authored-by: Jakub Novák <[email protected]>

Update packages/api/main.go

ca8fae5

Co-authored-by: Jakub Novák <[email protected]>

Lower the idle timeouts

04a5bd0

jakubno approved these changes May 8, 2025

View reviewed changes

ValentaTomas merged commit c82910b into main May 8, 2025
25 checks passed

ValentaTomas deleted the api-keepalive branch May 8, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reconfigure API timeouts #617

Reconfigure API timeouts #617

Uh oh!

ValentaTomas commented May 5, 2025

Uh oh!

jakubno May 6, 2025

Uh oh!

ValentaTomas May 7, 2025 •

edited

Loading

Uh oh!

jakubno May 7, 2025

Uh oh!

ValentaTomas May 8, 2025 •

edited

Loading

Uh oh!

ValentaTomas May 8, 2025 •

edited

Loading

Uh oh!

ValentaTomas May 8, 2025 •

edited

Loading

Uh oh!

ValentaTomas May 8, 2025 •

edited

Loading

Uh oh!

jakubno May 8, 2025

Uh oh!

Uh oh!

Uh oh!

ValentaTomas commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ValentaTomas commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reconfigure API timeouts #617

Reconfigure API timeouts #617

Uh oh!

Conversation

ValentaTomas commented May 5, 2025

Uh oh!

jakubno May 6, 2025

Choose a reason for hiding this comment

Uh oh!

ValentaTomas May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakubno May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ValentaTomas May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ValentaTomas May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ValentaTomas May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ValentaTomas May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakubno May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ValentaTomas commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ValentaTomas commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ValentaTomas May 7, 2025 •

edited

Loading

ValentaTomas May 8, 2025 •

edited

Loading

ValentaTomas May 8, 2025 •

edited

Loading

ValentaTomas May 8, 2025 •

edited

Loading

ValentaTomas May 8, 2025 •

edited

Loading

ValentaTomas commented May 8, 2025 •

edited

Loading