Skip to content

perf: enable TLS session resumption and 0-RTT in ingress nginx#17

Closed
Evrard-Nil wants to merge 1 commit into
mainfrom
perf/tls-session-resumption
Closed

perf: enable TLS session resumption and 0-RTT in ingress nginx#17
Evrard-Nil wants to merge 1 commit into
mainfrom
perf/tls-session-resumption

Conversation

@Evrard-Nil

Copy link
Copy Markdown
Contributor

Summary

Flip ssl_session_tickets and ssl_early_data from off to on in
the two nginx config generators (scripts/entrypoint.sh and
scripts/generate-nginx-upstream.sh). The session cache
(shared:SSL:50m) and timeout (1d) were already in place — only
tickets and 0-RTT needed enabling.

Also forward the Early-Data header to all proxied locations
(proxy_set_header Early-Data $ssl_early_data;) so downstream backends
can reject 0-RTT requests on non-idempotent paths if they choose.

Why

For repeat clients that do not keep their TLS connection alive
curl, mobile apps, some SDKs that close sockets between requests
— each new request currently pays a fresh TLS handshake on top of TCP.

  • TLS 1.3 full handshake: 1-RTT.
  • TLS 1.3 resumption with tickets: 1-RTT.
  • TLS 1.3 resumption with tickets + 0-RTT: 0-RTT — client sends
    the request along with the ClientHello.

From a developer machine at ~99ms RTT to cpu01, that's ~100ms saved
per cold reconnect
. The ingress sits on every CVM (cloud-api,
chat-api, inference CVMs), so the win applies to every public endpoint
behind it: cloud-api.near.ai, cloud.near.ai, agent.near.ai,
*.completions.near.ai.

0-RTT replay risk

RFC 8446 §8 documents that 0-RTT data can be replayed by an attacker
who captures the early-data payload. The mitigation surface is:

  1. nginx forwards Early-Data: 1 to the upstream when the request
    was carried in 0-RTT. Each location block in the generated configs
    now sets:

    proxy_set_header Early-Data $ssl_early_data;
  2. Backends can reject Early-Data on side-effectful methods. None
    of the backends currently behind this ingress check that header.
    Follow-ups to track (separate PRs):

    • cloud-api (cloud-api.near.ai) — POST /v1/chat/completions,
      POST /v1/responses, POST /v1/embeddings, etc. Strictly
      speaking these are idempotent in spec, but billing and audit
      logging mean we'd rather not replay them. Audit needed.
    • inference-proxy — similar; POST inference + GPU attestation.
    • chat-api (agent.near.ai) — POST chat / agent state mutations.
      Replay risk is highest here.

    Until those land, the worst case for a single replay is: an
    attacker who captured a ciphertext within the
    ssl_session_timeout window (1d) replays it; the backend processes
    it again. For LLM completions and stateless inference reads, the
    user-observable damage is minimal — duplicate billing event at
    worst. For chat-api the audit is more nuanced and is the most
    pressing follow-up.

The conservative alternative is ssl_early_data off; and accept the
~100ms loss per cold reconnect. I'm choosing on here because (a)
TLS 1.3 0-RTT is widely deployed (Cloudflare, Google, AWS ALB default
it on), (b) Early-Data is propagated downstream so the mitigation
hook is in place, and (c) we can flip back to off in a one-line
follow-up if any of the follow-up audits surface a real risk.

Scope

Intentionally narrow:

  • Only ssl_session_tickets, ssl_early_data, and the
    Early-Data proxy header.
  • No base-image bump, no listen 443 quic, no Alt-Svc, no layout
    changes.

There is a parallel HTTP/3 PR in flight that bumps the base nginx
image and adds QUIC. To minimize merge conflict, this PR steers clear
of the listen blocks and the base image. Either PR can land first; the
other rebases trivially.

Deployment

Requires rebuilding the dstack-ingress-vpc image and rolling it out
to every CVM that runs this ingress (cpu01/cpu02 cloud-api prod+stg,
agent0/agent1 CVMs, all inference CVMs). This PR does not initiate
rollout
; the image build job runs on merge to main, and the actual
CVM updates go through compose-manager / cvm-compose-files as usual.

Verification

  • bash -n clean on both modified scripts.
  • docker build . against the pinned base image (nginx@sha256:b6653fca…,
    which is nginx 1.27.4) succeeds.
  • nginx -t clean for all four config-generation paths, using the
    project's own built image:
    • single-target mode without rate limiting
    • single-target mode with RATE_LIMIT_PATHS
    • upstream LB mode without rate limiting
    • upstream LB mode with RATE_LIMIT_PATHS
  • After merge, plan to verify on one CVM (cpu01:9450 staging) before
    rolling fleet-wide: confirm Session-ID reused and 0-RTT indicated
    via openssl s_client -reconnect and curl --tls-max 1.3 -v over two
    separate connections.

Test plan

  • CI build job (Build & Deploy) succeeds
  • Image pulls into one staging CVM (cpu01 cloud-api-stg slot)
  • openssl s_client -connect cloud-stg-api.near.ai:443 -reconnect
    shows Reused, TLSv1.3 on second-and-later handshakes
  • curl -v https://cloud-stg-api.near.ai/health over a fresh
    connection shows Early data was accepted by the server (or
    equivalent client-side indicator) on a resumed handshake
  • No regression in normal request flow (smoke test
    POST /v1/chat/completions via the existing infra-tests run)
  • Roll out to remaining CVMs via compose-manager once staging is
    validated

Flip the existing ssl_session_tickets and ssl_early_data directives from
off to on in both nginx config generators (setup_nginx_conf in
entrypoint.sh and generate-nginx-upstream.sh). The session cache size
(shared:SSL:50m) and timeout (1d) were already configured.

For repeat clients that don't keep TLS connections alive (curl, mobile,
some SDKs), this eliminates ~1 RTT per reconnect: tickets give 1-RTT
TLS 1.3 resumption, ssl_early_data gives 0-RTT. From a typical client at
~100ms RTT to a CPU CVM, that is ~100ms saved per cold handshake.

0-RTT replay risk is mitigated by forwarding the Early-Data header to
backends in every proxied location, so cloud-api / chat-api /
inference-proxy can reject Early-Data on non-idempotent methods. None
of them act on this header today; follow-up audits are tracked in the
PR description.

Scope is intentionally narrow: no base-image bump, no UDP/QUIC listen
directives, no layout changes. There is a parallel PR adding HTTP/3 +
Alt-Svc; this one only touches SSL session / ticket / early-data
directives to minimize merge conflict.

Verified: docker build succeeds; nginx -t passes for all four
config-generation paths (single-target +/- rate-limit, upstream LB +/-
rate-limit) using the project's own pinned base image.
@claude

claude Bot commented May 13, 2026

Copy link
Copy Markdown

Review: TLS session resumption + 0-RTT

Thanks for the thorough write-up — the risk framing in the PR description is what made this easy to review. Most of the comments below are nits or follow-up suggestions; nothing here blocks merge in my opinion.

Higher-priority concerns

1. Ticket-key rotation / forward secrecy is overstated by the inline comment. The comment says:

ssl_session_tickets: nginx rotates ticket keys on reload — sufficient for a single-replica CVM.

Two things to tighten here:

  • nginx generates the ticket key once at startup, not on every reload. SIGHUP reloads do regenerate it, but reloads aren't automatic — without an external trigger, the same key encrypts every ticket for the lifetime of the worker.
  • Combined with ssl_session_timeout 1d, that means: if the ticket key is ever compromised (memory disclosure, side-channel, container snapshot), an attacker can decrypt up to one day of past resumed sessions. That defeats one of the main reasons we use ECDHE.

Given these are CVMs, the threat model is interesting — the worker memory is inside the TEE, so the surface is narrower than a generic VM. But it's worth either (a) softening the comment to be accurate, or (b) following up with a cron-driven `kill -HUP $(pidof nginx)` or an external `ssl_session_ticket_key` file rotated daily. Probably a separate PR.

2. Half-deployed mitigation window. The PR description acknowledges that none of the current backends inspect `Early-Data` yet, and lists the audits as separate follow-ups. That's fine as a sequencing choice, but the practical implication is: between merge and the backend-audit PRs landing, the deployed system has `ssl_early_data on` with no replay rejection at all. The mitigation hook (`Early-Data` header) is wired but no one is reading it.

For `chat-api` in particular — which the PR itself flags as the highest replay risk — it might be worth landing the backend `425 Too Early` check first, then this PR, rather than the other way around. Or at minimum, link the follow-up tickets here so the sequencing is visible.

3. WebSocket upgrades + 0-RTT. The `location ~ ^/(ws|socket.io)/` block also gets `Early-Data` forwarded, but a WebSocket upgrade handshake replayed in 0-RTT data would open a duplicate connection (and potentially bypass any one-shot auth-token-in-upgrade-request flow). I do not see WS auth here, but if any backend behind this ingress uses upgrade-time tokens, it is a real consideration. RFC 8470 explicitly calls out that protocol switches should not be allowed in 0-RTT. Easiest local fix: `if ($ssl_early_data) { return 425; }` inside the WS location, since `ssl_early_data` itself is only valid in `server` context.

Smaller things

4. gRPC mode. `PROXY_CMD=grpc` is set when `TARGET_ENDPOINT` starts with `grpc://`. All gRPC calls are POSTs over HTTP/2, and per-call idempotency is service-defined — so the same Early-Data forwarding rationale applies but backend-side mitigation is harder (most gRPC frameworks do not expose request headers to service code in a uniform way). Worth adding to the follow-up audit list if any gRPC backends sit behind this ingress.

5. The "strictly speaking idempotent" claim in the PR description. `POST /v1/chat/completions` is not idempotent per HTTP semantics (RFC 9110 §9.2.2) — POST is one of the methods explicitly defined as non-idempotent. I think you mean "logically idempotent in the application sense" — the underlying inference is a pure function, but billing, audit logging, and rate-limit counters are side effects. Worth phrasing precisely in the PR description so a reviewer does not conclude it is safe to replay.

6. Verification step naming. In the test plan, "Session-ID reused" is a TLS 1.2 concept (Session ID from RFC 5246). TLS 1.3 does not have a session ID in the same sense — it uses session tickets / PSK identity. `openssl s_client -reconnect` against a TLS 1.3 server will show `Reused, TLSv1.3` (which the next bullet already calls out correctly). Minor wording fix only.

7. Empty-value header behavior — confirm this is intentional. `proxy_set_header Early-Data $ssl_early_data;` will set `Early-Data: 1` when 0-RTT is used and omit the header entirely when it is not (nginx drops empty proxy_set_header values). That matches RFC 8470 §5.1, which is the right behavior — but worth a one-line code comment because the alternative ("always set, value empty when not 0-RTT") would look identical in the diff and be wrong. The existing comment says what the header is for but not that it intentionally relies on nginx's empty-value-elision behavior.

Test coverage

The repo does not have an automated test path for the generated nginx configs (the only checks are `bash -n` and a manual `nginx -t` matrix per the PR description). Pre-existing limitation, not introduced here. For a TLS-config PR specifically, a regression test that:

  1. runs `entrypoint.sh` / `generate-nginx-upstream.sh` with each combination of env vars,
  2. greps the output for the expected directives,

would be cheap and prevent quiet regressions. Not in scope for this PR but worth a follow-up if these scripts keep accumulating SSL config.

Things this PR does well

  • Narrow scope, clear rollback path (one-line flip back to `off`).
  • All four config-generation modes manually exercised against `nginx -t`.
  • Risk analysis in the PR description names the specific backends and the specific replay surface, instead of generic hand-waving.
  • Doesn't conflict with the parallel HTTP/3 PR — the scoping choice is good.
  • The deployment note ("this PR does not initiate rollout") is exactly the kind of guard that prevents accidental fleet-wide rollouts.

Overall: I'd be comfortable merging this, with a preference for landing the `chat-api` Early-Data check first if that's possible without much extra effort. Otherwise the follow-up audit tickets should be filed before this lands so the sequencing is tracked.

🤖 Generated with Claude Code

@Evrard-Nil

Copy link
Copy Markdown
Contributor Author

Scope note (post-investigation): this PR affects chat-api (private.near.ai, agent.near.ai) and inference CVMs (vllm-ingress), which terminate TLS via nearaidev/dstack-ingress-vpc. It does not affect cloud-api.near.ai — that uses a different ingress (nearaidev/cvm-ingress, sidecar pattern). For TLS-handshake wins on cloud-api, a parallel PR against nearai/cvm-ingress is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant