Skip to content

websockets: fix ping_timeout #3376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 22, 2025
Merged

Conversation

oliver-sanders
Copy link
Contributor

@oliver-sanders oliver-sanders commented Apr 30, 2024

Closes #3258
Closes #2905
Closes #2655

Fixes an issue with the calculation of ping timeout interval which caused connections to be erroneously closed from the server end. This could happen shortly after the connection was opened, before the ping was even sent (as reported in #3258) as the result of startup logic. It could also happen when pong responses were sent back from the client within the configured timeout as the result of a race condition.

To test this fix, try repeating the steps in this example: #3258 (comment)

I found a TODO to implement testing for ping timeout. I tried to fill this in but couldn't get it to work, any pointers appreciated (note, it has to be tested from the server side).

Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this is probably going to involve some ugly hacks one way or another. My first thought is to just overwrite the write_ping method: self.ws_connection.write_ping = lambda x: None (and factor out a write_pong method for the other side where it currently calls self._write_frame(True, 0xA, data)).

I think there are multiple distinct problems with this ping timeout code:

  • If ping_timeout == ping_interval, the connection breaks before the first ping is sent, and if that is avoided, it is likely to break other times due to race conditions. This was your finding in websocket disconnect and reconnect every 10 minutes #3258 (comment) and is fixed by this PR
  • If ping_timeout > ping_interval, we are slow to detect failure - we don't check for timeout until the next ping is sent (websocket: Fix ping_timeout < ping_interval #2905). This is also fixed by your PR (and it's a good reason to restructure the code instead of a small change to the since_last_(ping,pong) logic in periodic_ping.
  • Other folks, such as the original message in websocket disconnect and reconnect every 10 minutes #3258, have reported problems when ping_interval and ping_timeout are not the same. It's not clear to me what exactly the problem is here, or whether this PR fixes it.
  • Does it actually make sense to set ping_interval < ping_timeout? That's the default, but since websockets go over TCP I'm not sure it makes sense to allow multiple pings in flight before detecting failure. I'm not sure what I was thinking when this was written/merged, but now I feel like a sensible configuration would be something like ping_interval=30; ping_timeout=5.

@oliver-sanders
Copy link
Contributor Author

Does it actually make sense to set ping_interval < ping_timeout?

I did find it a bit strange that this is possible.


Since websockets are TCP, I don't think it really makes sense to use a short ping interval with a long timeout,

I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result, ping_timeout > ping_interval does not make sense at the Websocket layer either as we don't have the information required to implement it that way.


I feel like a sensible configuration would be something like ping_interval=30; ping_timeout=5.

👍


Unconditionally sleeping for ping_timeout will cause problems if ping_timeout is greater than ping_interval (which is the default) - we won't start the next ping until the old timeout sleep has passed.

I did consider this, but thought of it more as the result of a quirky configuration rather than a bug in the implementation.

For configurations where ping_timeout > ping_interval:

  • Under normal circumstances, it will ping at the ping_interval.
  • If the client gets laggy, the ping frequency will drop to ~2x ping time.
  • If the client goes quiet, the server will stop pinging until the timeout is hit (or a pong is received).

Unless we want to effectively require that ping_interval >= ping_timeout

One option would be to reject such configurations:

assert ping_timeout is None or ping_timeout <= ping_interval

However, that might be disruptive. Another option would be to raise a warning and fall back to the ping_interval:

if ping_timeout > ping_interval:
    app_log.warn(
        'The configured websocket_ping_timeout is longer than the websocket_ping_interval.'
        '\nSetting websocket_ping_timeout = websocket_ping_interval'
    )
    ping_timeout = ping_interval

Other folks, such as the original message in #3258, have reported problems when ping_interval and ping_timeout are not the same.

Setting ping_timeout = ping_interval is an easy way to reveal the flaws with the current logic, however, it can fail when these values are not the same too.



Here are my suggestions:

  1. Send pings and respond to timeouts in the same coroutine i.e. remove the PeriodicCallback (as suggested above).
  2. Change the default ping_timeout from max(3 * self.ping_interval, 30) to max(self.ping_interval, 30).
  3. If ping_timeout is configured longer than ping_interval, set ping_timeout = ping_interval and log a warning.

What do you think?

@bdarnell
Copy link
Member

I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result, ping_timeout > ping_interval does not make sense at the Websocket layer either as we don't have the information required to implement it that way.

They can, actually. RFC 6455 section 5.5.2 and 5.5.3 say that a ping frame may include "application data", and a pong frame must copy the application data from the ping frame it is responding to. So the information is there to match up pings and pongs. But even then, would it be useful to send a flurry of pings faster than the RTT? I'm not seeing any use case for that (at least in HTTP/1 and HTTP/2 - maybe in HTTP/3 depending on how they model the websocket stream in a reorderable connection)

For configurations where ping_timeout > ping_interval:

Under normal circumstances, it will ping at the ping_interval.

I don't think that's true - there's no early exit from the sleep, so it will ping at ping_timeout instead of ping_interval. That's what I was getting at when I said "unconditional sleep". Am I missing something? I'd be OK with the behavior you're describing in this message but not what I think the code is doing.

One option would be to reject such configurations:

This is probably the right answer; I only hesitate because the default ping_timeout property gets this backwards. We can change this default, but the question is whether people have made explicit configurations based on this default too. Overall I think it's probably best to set ping_timeout = ping_interval if they're inverted just to avoid transition pains.

  1. Send pings and respond to timeouts in the same coroutine i.e. remove the PeriodicCallback (as suggested above).
  2. Change the default ping_timeout from max(3 * self.ping_interval, 30) to max(self.ping_interval, 30).
  3. If ping_timeout is configured longer than ping_interval, set ping_timeout = ping_interval and log a warning.

Yes, except that I'd rather not have a warning at all than have a warning every time a websocket is opened. I'm fine without a warning; if you'd like to keep it put it behind some kind of flag so it only logs once (I don't think we have an existing idiom for this).

@oliver-sanders
Copy link
Contributor Author

@bdarnell - Apologies for the delay.

I have implemented the above suggestions and found a way to test.

One change to the above discussion, I changed the default ping timeout to min(ping_interval, 30) (i.e. use the maximum permitted timeout up to 30 seconds, then continue with 30 seconds thereafter).

@@ -97,6 +99,9 @@ def log_exception(

_default_max_message_size = 10 * 1024 * 1024

# log to "gen_log" but suppress duplicate log messages
de_dupe_gen_log = functools.lru_cache(gen_log.log)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat trick; I haven't seen this one before.

@@ -274,17 +279,40 @@ async def get(self, *args: Any, **kwargs: Any) -> None:

@property
def ping_interval(self) -> Optional[float]:
"""The interval for websocket keep-alive pings.
"""Send periodic pings down the websocket.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc style nit: This is a property, not a method, so it should be described as a noun and not a verb.

What is "This" in the next sentence? Should say something like "If this is non-zero, the websocket will send a ping every ping_interval seconds".

We should probably explain somewhere the relationship between these ping_interval properties and the corresponding websocket_ping_interval app setting if it's not already clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorted.

return timeout
assert self.ping_interval is not None
return max(3 * self.ping_interval, 30)
return max(self.ping_interval, 30)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disagrees with the docs, which say min instead of max. It needs to be min to satisfy the new rule.

Should that 30 even be there at all? I think it would be simplest and most consistent to just say that if ping_interval is set, ping_timeout defaults to ping_interval (and ping_timeout can be explicitly set to any value <= ping_interval if desired)

Copy link
Contributor Author

@oliver-sanders oliver-sanders Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected the max (sorry).

Should that 30 even be there at all?

A reasonable ping interval might be 1 hour, but I don't think it would ever be reasonable for a websocket pong response to take longer than 30 seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reasonable ping interval might be 1 hour, but I don't think it would ever be reasonable for a websocket pong response to take longer than 30 seconds.

Is an hour a reasonable ping interval? That's a long time to wait to detect a failed connection.

On the other side, it's not only about the time a response takes, it's about how long the device can go to sleep before the connection is broken. If my router will leave a connection open while my laptop/phone is asleep, I don't think we need to have a separate hard-coded timeout here.

Stepping back, applications don't really care about the details here as much as they care about the time from the connection breaking to the connection-close callback. So even though interval=3600,timeout=30 has a much lower timeout than interval=timeout=3600, it only cuts the time-to-detect in half. I think it feels reasonable to say "by default failed connections will be detected in 2x the interval in the worst case; adjust the timeout value if you want to" instead of picking a hard-coded limit here (especially when my expectation is that most people who use timeouts with websockets will want an interval closer to a minute than an hour).

and since_last_pong > self.ping_timeout
):
self.close()
if interval <= 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should probably go in start_pinging so we don't create a task unnecessarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# make sure we received a pong within the timeout
if (
timeout > 0
# and ping_time - self.last_pong >= timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like it was supposed to be left commented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

# and ping_time - self.last_pong >= timeout
and (
# pong took longer than the timeout
self.last_pong - ping_time > timeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is only possible in edge cases: the timeout has passed, but before this coroutine could be rescheduled, the pong was received. I don't think that matters and we could simplify a little bit. I think last_pong could be replaced with a received_pong boolean that is set to false when we send a ping and to true when we receive a pong.

Copy link
Contributor Author

@oliver-sanders oliver-sanders Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Removed this check (if the client responded slightly outside of the timeout but the server didn't check within this window, we might as well keep the connection open).
  • Replaced last_pong: int with _received_pong: bool.

return

# wait until the next scheduled ping
await asyncio.sleep(interval - timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will drift a little bit, especially if your IOLoop gets blocked from time to time. I'm not sure that really matters but if we care we could sleep for IOLoop.current().time() - ping_time + interval.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't matter, but went with your suggestion anyway (might as well be as accurate with our pings as the clock allows).

@@ -831,14 +835,79 @@ def on_ping(self, data):

@gen_test
def test_client_ping(self):
ws = yield self.ws_connect("/", ping_interval=0.01)
ws = yield self.ws_connect("/", ping_interval=0.01, ping_timeout=0)
for i in range(3):
response = yield ws.read_message()
self.assertEqual(response, "got ping")
# TODO: test that the connection gets closed if ping responses stop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this TODO with your new test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class PingHandler(TestWebSocketHandler):
def initialize(self, close_future=None, compression_options=None):
# capture the handler instance so we can interrogate it later
nonlocal handlers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this nonlocal statement is unnecessary (I use the pattern of a mutable list because it worked before nonlocal was introduced; I haven't bothered to learn how to use nonlocal myself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

(I just have a habit of using nonlocal in these situations for clarity)

# print(f'$ {oppcode} {data}') # uncomment to debug
if oppcode == 0xA: # NOTE: 0x9=ping, 0xA=pong
from time import sleep
sleep(delay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking time.sleep seems problematic here - it's not a realistic simulation of what happens in production (but now I see why you did the time checking below).

How about this as a testing technique: instead of delay_pong, do drop_ping which replaces protocol.write_ping with a no-op (you can use unittest.mock.patch for this). Then pings don't get sent (even though the ping coroutine thinks they are), pongs don't get sent, and the connection should get closed on its own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to test the actual timings as closely as I could, but agree, it's a bit bunk.

I've changed this from delaying the client pong, to just suppressing the client pong (no need to suppress the server ping).

I haven't used patch for this as it's an instance method (so doesn't need to be reset post testing) and it's easier to target just the pongs with a wrapper.

* Closes tornadoweb#3258
* Closes tornadoweb#2905
* Closes tornadoweb#2655
* Fixes an issue with the calculation of ping timeout interval that
  could cause connections to be erroneously timed out and closed
  from the server end.
Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. After CI I can merge and take care of the max(30).

@bdarnell bdarnell merged commit 5b349e5 into tornadoweb:master Apr 22, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants