Skip to content

Commit 19add74

Browse files
committed
link gcp-fss-communication guide to HTTP client connection management documentation
1 parent cf0b949 commit 19add74

File tree

2 files changed

+298
-0
lines changed

2 files changed

+298
-0
lines changed
Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
tags: [explanation, http, connections, timeouts, networking]
3+
---
4+
5+
# HTTP client connection management
6+
7+
This page explains how HTTP client connection management works, including connection pooling, timeouts, and how network infrastructure affects your application's reliability.
8+
9+
## Why connection management matters
10+
11+
Modern applications make hundreds or thousands of HTTP requests. Opening a new TCP connection for each request is expensive:
12+
13+
- **TCP handshake overhead**: Three-way handshake (SYN, SYN-ACK, ACK) adds latency
14+
- **TLS handshake**: Additional round trips for certificate exchange and key agreement
15+
- **Slow start**: TCP congestion control starts with small windows
16+
- **Resource consumption**: Each new connection consumes system resources
17+
18+
HTTP connection pooling solves this by reusing established connections, dramatically improving performance and reducing load.
19+
20+
## How connection pooling works
21+
22+
### Connection lifecycle
23+
24+
```mermaid
25+
stateDiagram-v2
26+
[*] --> Establishing: Request needs connection
27+
Establishing --> Active: TCP + TLS handshake complete
28+
Active --> Idle: Request complete, kept in pool
29+
Idle --> Active: Reused for new request
30+
Idle --> Closed: TTL expired or evicted
31+
Active --> Closed: Connection error
32+
Closed --> [*]
33+
```
34+
35+
When no pooled connection is available, the client performs DNS lookup, establishes TCP connection, negotiates TLS (if HTTPS), then sends the request. After the response completes, the connection returns to the pool instead of closing. The next request to the same host reuses this connection, bypassing all handshake overhead.
36+
37+
Connections are removed from the pool when: TTL expires, idle timeout reached, background eviction runs, connection error detected, or pool size limit exceeded.
38+
39+
### HTTP Keep-Alive
40+
41+
Connection pooling relies on HTTP Keep-Alive:
42+
43+
```http
44+
Connection: keep-alive
45+
Keep-Alive: timeout=60, max=100
46+
```
47+
48+
This signals that the connection should remain open after the response completes. Both client and server must support it.
49+
50+
## DNS and connection failures
51+
52+
DNS plays a critical role in connection establishment and can be a source of intermittent failures.
53+
54+
### DNS caching issues
55+
56+
**Stale DNS cache:** When service IPs change (pod rotation, deployment), cached DNS entries point to old IPs, causing connection failures until TTL expires.
57+
58+
**Key considerations:**
59+
60+
- Services on Nais have short DNS TTL (30 seconds)
61+
- Client-side DNS cache may not respect TTL
62+
- JVM caches DNS indefinitely by default (set `networkaddress.cache.ttl`)
63+
- Connection pools may hold connections to old IPs
64+
65+
### DNS failures
66+
67+
DNS resolution can fail due to server overload, network partitions, or rate limiting, causing "Unknown host" errors even when services are healthy.
68+
69+
**Mitigation:**
70+
71+
- Set connection TTL to 5-10 minutes for periodic DNS re-resolution
72+
- Implement retry logic for DNS failures
73+
- Connection pooling reduces DNS lookup frequency
74+
75+
## Understanding timeout types
76+
77+
Different timeout settings control different aspects of connection behavior. Configuring them correctly is critical for reliability.
78+
79+
### Connection timeout
80+
81+
**What it controls:** Maximum time to wait for the initial TCP connection to establish.
82+
83+
**Common names:**
84+
85+
- `connectTimeout` (most libraries)
86+
- `CONNECT_TIMEOUT_MILLIS` (Netty)
87+
- Connection timeout (Apache HttpClient)
88+
89+
**Typical values:** 5-10 seconds
90+
91+
**What happens when exceeded:** Connection attempt fails immediately with a timeout exception.
92+
93+
**When to adjust:**
94+
95+
- Cross-cluster or cross-datacenter calls with high latency
96+
- Calls through multiple proxies
97+
- Networks with packet loss
98+
99+
### Socket/idle timeout
100+
101+
**What it controls:** Maximum time a connection can remain idle in the pool before being removed.
102+
103+
**Common names:**
104+
105+
- `timeout` (Node.js Agent)
106+
- `connectionTimeToLive` (Apache HttpClient)
107+
- `maxIdleTime` (Reactor Netty)
108+
- `keepAliveTime` (Ktor)
109+
110+
**Typical values:** Based on infrastructure timeout constraints (e.g., 55 minutes for on-prem firewall timeouts)
111+
112+
**What happens when exceeded:** Connection is closed and removed from pool.
113+
114+
**Why it matters:** Prevents attempting to reuse connections that network infrastructure has already dropped.
115+
116+
### Read/response timeout
117+
118+
**What it controls:** Maximum time to wait for the complete response after sending a request.
119+
120+
**Common names:**
121+
122+
- `responseTimeout` (Reactor Netty)
123+
- `requestTimeout` (Ktor)
124+
- `timeout` (Axios - request-level)
125+
- Read timeout (Apache HttpClient)
126+
127+
**Typical values:** 10-60 seconds, depending on endpoint characteristics
128+
129+
**What happens when exceeded:** Request is cancelled with a timeout exception.
130+
131+
**When to adjust:**
132+
133+
- Long-running operations (batch processing, report generation)
134+
- Large file downloads
135+
- Streaming responses
136+
137+
### Background eviction
138+
139+
**What it controls:** Periodic cleanup of idle or stale connections from the pool.
140+
141+
**Common names:**
142+
143+
- `evictIdleConnections` (Apache HttpClient)
144+
- `evictInBackground` (Reactor Netty)
145+
146+
**Typical values:** Every 5 minutes
147+
148+
**Why it matters:** Removes connections that may have been silently dropped by network infrastructure between requests, preventing errors on the next request.
149+
150+
## How network infrastructure affects connections
151+
152+
### Stateful firewalls
153+
154+
Firewalls maintain connection state tables and drop idle connections to prevent exhaustion. Most firewalls drop connections **silently** without TCP FIN or RST packets - connections appear healthy in the pool until you try to reuse them.
155+
156+
**Solution:** Configure connection TTL below firewall timeout threshold.
157+
158+
### Load balancers and NAT gateways
159+
160+
Load balancers and NAT gateways enforce their own idle timeouts (typically 60-600 seconds).
161+
162+
**Key points:**
163+
164+
- Client connection TTL should be less than load balancer timeout
165+
- Keep-alive probes may not prevent timeouts
166+
- Backend service connections have separate timeouts
167+
- NAT timeout shorter than client TTL means silent connection drops
168+
169+
### Proxies
170+
171+
Forward and reverse proxies add another layer of timeout configuration:
172+
173+
- **Proxy → Backend timeout**: How long proxy waits for backend response
174+
- **Client → Proxy timeout**: How long client waits for proxy response
175+
- **Proxy connection pooling**: Proxy may maintain separate connection pool to backends
176+
177+
## Connection pool sizing
178+
179+
### Maximum connections
180+
181+
**Per-route/per-host limits:**
182+
183+
Prevents overwhelming a single backend service:
184+
185+
```java
186+
cm.setDefaultMaxPerRoute(20); // Max 20 concurrent connections per host
187+
```
188+
189+
**Total pool size:**
190+
191+
Limits total connections across all hosts:
192+
193+
```java
194+
cm.setMaxTotal(200); // Max 200 connections total
195+
```
196+
197+
### Pool exhaustion
198+
199+
When all connections are in use, new requests must:
200+
201+
- Wait for a connection to become available
202+
- Timeout if wait exceeds configured limit
203+
- Potentially fail with "Connection pool exhausted"
204+
205+
**Symptoms:**
206+
207+
- Requests fail even though backend is healthy
208+
- High request latencies during traffic spikes
209+
- "NoHttpResponseException" or similar errors
210+
211+
**Solutions:**
212+
213+
- Increase pool size if resources allow
214+
- Reduce response timeout to fail faster
215+
- Add circuit breaker to prevent cascade failures
216+
- Scale application horizontally
217+
218+
## Common configuration mistakes
219+
220+
### Infinite or too-long connection TTL
221+
222+
**Problem:** Connections never expire or expire after infrastructure drops them.
223+
224+
**Symptoms:** Intermittent "Connection reset" or "Unexpected end of stream" errors, especially after idle periods.
225+
226+
**Solution:** Set connection TTL below infrastructure timeout thresholds (e.g., 55 minutes for 60-minute firewall timeout).
227+
228+
### No background eviction
229+
230+
**Problem:** Dead connections remain in pool until used.
231+
232+
**Symptoms:** First request after idle period fails, subsequent retry succeeds.
233+
234+
**Solution:** Enable background eviction (e.g., every 5 minutes).
235+
236+
### Confusing request timeout with connection TTL
237+
238+
**Problem:** Setting very short request timeout thinking it will refresh connections.
239+
240+
**Symptoms:**
241+
242+
- Legitimate long-running requests fail
243+
- Unnecessary request failures and retries
244+
245+
**Solution:** Use connection TTL for pool management, request timeout for detecting hung requests.
246+
247+
## Nais platform considerations
248+
249+
### Pod lifecycle and connection pools
250+
251+
On Nais, when your application pods are terminated (during deployments, scaling, or node maintenance):
252+
253+
1. Pod receives SIGTERM signal
254+
2. Pod enters "Terminating" state
255+
3. Endpoints removed from Service (eventual consistency)
256+
4. Grace period allows in-flight requests to complete (default 30s)
257+
258+
**Implications for connection pools:**
259+
260+
- Your application may have pooled connections to terminating pods of other services
261+
- Requests to terminating pods may fail if grace period expires
262+
- Need proper retry logic for pod rotation scenarios
263+
264+
**Best practices on Nais:**
265+
266+
- Implement graceful shutdown in your application
267+
- Configure preStop hooks to delay shutdown
268+
- Use readiness probes to stop traffic before shutdown
269+
- Implement client-side retry with exponential backoff
270+
271+
### Cross-cluster and cross-datacenter calls
272+
273+
Higher network latency affects timeout tuning:
274+
275+
**Same cluster:**
276+
277+
- Connection timeout: 5-10 seconds
278+
- Read timeout: 10-30 seconds
279+
280+
**Cross-cluster or cross-datacenter:**
281+
282+
- Connection timeout: 10-15 seconds
283+
- Read timeout: 30-60 seconds
284+
285+
Also consider:
286+
287+
- Retry budget (avoid retry storms)
288+
- Circuit breaker thresholds
289+
- Hedged requests for latency-sensitive calls
290+
291+
## Related resources
292+
293+
{% if tenant() == "nav" %}
294+
- [Communicate reliably between GCP and on-prem](../how-to/gcp-fss-communication.md) - Practical configuration for on-premises firewall timeouts
295+
{% endif %}
296+
- [Access policies](../how-to/access-policies.md) - Configure network access between services
297+
- [Good practices](good-practices.md) - Application development best practices

docs/workloads/how-to/gcp-fss-communication.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,7 @@ Monitor application logs for these errors (should decrease after configuration):
189189

190190
## Related resources
191191

192+
- [HTTP client connection management](../explanations/http-client-connection-management.md) - Understanding connection pooling and timeouts
192193
- [Access policies](access-policies.md) - Configure outbound access to FSS services
193194
- [Migrating to GCP FAQ](../explanations/migrating-to-gcp.md#how-do-i-reach-an-application-found-on-premises-from-my-application-in-gcp) - Overview of GCP-FSS communication
194195
- [OpenTelemetry metrics](../../observability/metrics/reference/otel.md#http-client-metrics) - Available HTTP client metrics

0 commit comments

Comments
 (0)