Skip to content

Reconfigure API timeouts #617

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 8, 2025
Merged

Reconfigure API timeouts #617

merged 8 commits into from
May 8, 2025

Conversation

ValentaTomas
Copy link
Member

The read header timeout lower than the backend timeout and the idle timeout lower than the LB idle timeout can be causing connection problems.

@ValentaTomas ValentaTomas self-assigned this May 5, 2025
@ValentaTomas ValentaTomas added bug Something isn't working improvement Improvement for current functionality labels May 5, 2025
url_map = google_compute_url_map.orch_map.self_link
name = "${var.prefix}https-proxy"
url_map = google_compute_url_map.orch_map.self_link
http_keep_alive_timeout_sec = 540
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to change this? can this be changed without any disruption?

Copy link
Member Author

@ValentaTomas ValentaTomas May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this the LB client facing part has a default timeout of 610s (that can be configured), but the backend has reconfigurable timeout of 600s. I don't know why these are the defaults, because of the issue this can cause:
The reason to set the timeout this way (rising going upstream) are here.
This applies to both API and sandbox traffic.

This could be even lower (300 might be ok too), but 540 seems reasonable.

As for the reload without disruption—it should be just part of Envoy config, so most likely yes, but I have not tested this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's misunderstanding, I think the default Load Balancer timeout is higher than the one to service on purpose, so it can close the connection to backend gracefully and to maximize the connection length to client, this should reduce latency

Time →──────────────────────────────────────────────────────────────────────────────▶
  • Client–GFE Idle Keep-Alive Timeout = 610 s
  • GFE–Backend Idle Keep-Alive Timeout = 600 s (fixed)
  • Your Server Idle Keep-Alive Timeout > 620 s

─── At 0 s ────────────────────────────────────────────────────────────────────────  
   Client opens a TCP connection to GFE  
   GFE opens a TCP connection to your server  
   (Both sides handshake, HTTP keep-alive enabled)

─── Idle, no requests for 600 s ─────────────────────────────────────────────────  
   At **600 s**, GFE notices its **backend socket** has been idle for 600 s → GFE sends a **FIN** to your server and closes that backend TCP connection.  
   Your server (with 620 s idle config) has not yet closed its side, so it does a **graceful close** when it sees GFE’s FIN.

─── Idle from 600 s to 605 s ──────────────────────────────────────────────────  
   Client ↔ GFE connection still open (it only times out at 610 s)  
   Backend connection is now closed on both ends (GFE closed it at 600 s; server followed suit)

─── At **605 s**, client issues a new HTTP request over the **existing** TCP connection to GFE  
   1. **GFE receives** the request on the client socket (still alive).  
   2. GFE looks for an **idle backend socket** to forward on—but it closed that at 600 s.  
   3. GFE immediately **opens a brand-new TCP connection** to your server (SYN → SYN-ACK → ACK).  
   4. GFE **forwards** the client’s request over that new connection.  
   5. Your server processes and replies; GFE relays the response back to the client.  

─── Connection now active again; both sides may keep it alive for their configured idle timeouts ────▶


Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be right, the fact that they did set it up this way by default is at least suspicious.

Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's scrap this for now.

Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking, if we should configure our proxies to also have (620s-610s, 630s-620s) timeouts.

Copy link
Member Author

@ValentaTomas ValentaTomas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really thinking if the LB's 610s-600s (fixed) have any advantage, because the client's (-> LB) timeout should already be <610s, so the connection would still be closed from the client side at that point.
I'm just thinking if there is a point of making the "client" part of each proxy timeout lower than the "server"—wouldn't it work equally well if the "client" part timeout was slightly higher (or the same, doesn't matter in the end) too if the next server timeout was still bigger?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two idle timeouts, downstream and upstream and the rule for (downstream > upstream) is mainly for the same service.

If you close the downstream connection first, the new downstream connection can try to reuse the TCP connection even though it's already in CLOSE_WAIT state, because transport doesn't check the connection state.

Doing this results in error.

  1. For different services you are trying to prevent both services sending FIN at the same time, it's a good practice to add a cushion towards upstream, so you don't have to recreate these connections. Theoretically you could have it the other way around

So the important idle timeout is between server and client in the same service, I committed it with a comment

@ValentaTomas
Copy link
Member Author

I also wanted to mention a few relevant things to the setup

  • Previously, in the code interpreter, the idle timeout was set to a small default value: Configure code interpreter server idle timeout code-interpreter#102
  • We are also missing the manual "ping" in the code interpreter that in envd keeps the connection open
  • I think the default behavior of nodejs fetch changed and in the latest version it might be using keepalive—we should ensure it is configured the same in all our our SDKs (API+envd client, Python+JS)

@ValentaTomas
Copy link
Member Author

ValentaTomas commented May 8, 2025

I lowered the timeouts in proxies to take into account the +10s so the timeout of the orchestrator proxy server is not the same as the changed timeout of the envd server (https://github.com/e2b-dev/infra/pull/616/files).

@ValentaTomas ValentaTomas merged commit c82910b into main May 8, 2025
25 checks passed
@ValentaTomas ValentaTomas deleted the api-keepalive branch May 8, 2025 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement Improvement for current functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants