Skip to content

Avoid long stalls during TLS handshakes #2706

Open
@nyh

Description

@nyh

In one ScyllaDB workload using an HTTPS server, we noticed that each connection establishment causes a roughly-30ms stall.

While the fact that each TLS handshake is taking 30ms is sad (it means that each shard can only do about 30 of those per second...), what is much more troubling for a Seastar applications is that these handshakes happen without preemption points, and cause a 30ms stall and potentially huge latencies for other requests running on this shard.

This issue isn't about making handshakes faster (which we should do) or reducing their numbers (which we've been doing - #2154 is one attempt at reducing their number). This isssue is about avoiding the stall during the handshake if we can't avoid the handshake.

I can think of different ways to avoid these stalls, with decreasing level of desirability but increasing easiness of implementation:

  1. Modify the TLS implementation to use Seastar futures and incorporate preemption checks. This is probably not a realistic solution without massive modifications to OpenSSL - unless OpenSSL comes with hooks to do that, which I'm guessing it doesn't.
  2. A simpler version of 1, probably still requiring modifications to OpenSSL but much fewer, is to run these TLS handshakes in a seastar::thread and add preemption points in the right places.
  3. An approach that could work without modifications to OpenSSL is run in it in a different Linux thread. This will be ugly but we've already been reserving in some setups separate cores for networking, so maybe it makes sense to do the same also for TLS requests. Or, even if we run these TLS threads on the same cores as ordinary Seastar (the horror!), we'll still get stalls (when the Seastar thread isn't running) but probably not 30ms stalls.

Another thing we should do that I'll tack onto this issue but perhaps should be split into a different issue, is to add metrics that will be useful for analyzing these slow TLS handshake problems. Perhaps count the number of handshakes or count of various cryptographic calculations or something, and perhaps we can also count the amount of time that each handshake takes (if there is no preemption, it's easy to calculate this time).

CC @elcallio @avikivity

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions