websockets: fix ping_timeout #3376

oliver-sanders · 2024-04-30T15:51:58Z

Closes #3258
Closes #2905
Closes #2655

Fixes an issue with the calculation of ping timeout interval which caused connections to be erroneously closed from the server end. This could happen shortly after the connection was opened, before the ping was even sent (as reported in #3258) as the result of startup logic. It could also happen when pong responses were sent back from the client within the configured timeout as the result of a race condition.

To test this fix, try repeating the steps in this example: #3258 (comment)

~~I found a TODO to implement testing for ping timeout. I tried to fill this in but couldn't get it to work, any pointers appreciated (note, it has to be tested from the server side).~~

bdarnell

Testing this is probably going to involve some ugly hacks one way or another. My first thought is to just overwrite the write_ping method: self.ws_connection.write_ping = lambda x: None (and factor out a write_pong method for the other side where it currently calls self._write_frame(True, 0xA, data)).

I think there are multiple distinct problems with this ping timeout code:

If ping_timeout == ping_interval, the connection breaks before the first ping is sent, and if that is avoided, it is likely to break other times due to race conditions. This was your finding in #3258 (comment) and is fixed by this PR
If ping_timeout > ping_interval, we are slow to detect failure - we don't check for timeout until the next ping is sent (#2905). This is also fixed by your PR (and it's a good reason to restructure the code instead of a small change to the since_last_(ping,pong) logic in periodic_ping.
Other folks, such as the original message in #3258, have reported problems when ping_interval and ping_timeout are not the same. It's not clear to me what exactly the problem is here, or whether this PR fixes it.
Does it actually make sense to set ping_interval < ping_timeout? That's the default, but since websockets go over TCP I'm not sure it makes sense to allow multiple pings in flight before detecting failure. I'm not sure what I was thinking when this was written/merged, but now I feel like a sensible configuration would be something like ping_interval=30; ping_timeout=5.

tornado/websocket.py

oliver-sanders · 2024-07-11T13:38:33Z

Does it actually make sense to set ping_interval < ping_timeout?

I did find it a bit strange that this is possible.

Since websockets are TCP, I don't think it really makes sense to use a short ping interval with a long timeout,

I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result, ping_timeout > ping_interval does not make sense at the Websocket layer either as we don't have the information required to implement it that way.

I feel like a sensible configuration would be something like ping_interval=30; ping_timeout=5.

👍

Unconditionally sleeping for ping_timeout will cause problems if ping_timeout is greater than ping_interval (which is the default) - we won't start the next ping until the old timeout sleep has passed.

I did consider this, but thought of it more as the result of a quirky configuration rather than a bug in the implementation.

For configurations where ping_timeout > ping_interval:

Under normal circumstances, it will ping at the ping_interval.
If the client gets laggy, the ping frequency will drop to ~2x ping time.
If the client goes quiet, the server will stop pinging until the timeout is hit (or a pong is received).

Unless we want to effectively require that ping_interval >= ping_timeout

One option would be to reject such configurations:

assert ping_timeout is None or ping_timeout <= ping_interval

However, that might be disruptive. Another option would be to raise a warning and fall back to the ping_interval:

if ping_timeout > ping_interval:
    app_log.warn(
        'The configured websocket_ping_timeout is longer than the websocket_ping_interval.'
        '\nSetting websocket_ping_timeout = websocket_ping_interval'
    )
    ping_timeout = ping_interval

Other folks, such as the original message in #3258, have reported problems when ping_interval and ping_timeout are not the same.

Setting ping_timeout = ping_interval is an easy way to reveal the flaws with the current logic, however, it can fail when these values are not the same too.

Here are my suggestions:

Send pings and respond to timeouts in the same coroutine i.e. remove the PeriodicCallback (as suggested above).
Change the default ping_timeout from max(3 * self.ping_interval, 30) to max(self.ping_interval, 30).
If ping_timeout is configured longer than ping_interval, set ping_timeout = ping_interval and log a warning.

What do you think?

bdarnell · 2024-09-27T18:39:48Z

I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result, ping_timeout > ping_interval does not make sense at the Websocket layer either as we don't have the information required to implement it that way.

They can, actually. RFC 6455 section 5.5.2 and 5.5.3 say that a ping frame may include "application data", and a pong frame must copy the application data from the ping frame it is responding to. So the information is there to match up pings and pongs. But even then, would it be useful to send a flurry of pings faster than the RTT? I'm not seeing any use case for that (at least in HTTP/1 and HTTP/2 - maybe in HTTP/3 depending on how they model the websocket stream in a reorderable connection)

For configurations where ping_timeout > ping_interval:

Under normal circumstances, it will ping at the ping_interval.

I don't think that's true - there's no early exit from the sleep, so it will ping at ping_timeout instead of ping_interval. That's what I was getting at when I said "unconditional sleep". Am I missing something? I'd be OK with the behavior you're describing in this message but not what I think the code is doing.

One option would be to reject such configurations:

This is probably the right answer; I only hesitate because the default ping_timeout property gets this backwards. We can change this default, but the question is whether people have made explicit configurations based on this default too. Overall I think it's probably best to set ping_timeout = ping_interval if they're inverted just to avoid transition pains.

Send pings and respond to timeouts in the same coroutine i.e. remove the PeriodicCallback (as suggested above).

Change the default ping_timeout from max(3 * self.ping_interval, 30) to max(self.ping_interval, 30).

If ping_timeout is configured longer than ping_interval, set ping_timeout = ping_interval and log a warning.

Yes, except that I'd rather not have a warning at all than have a warning every time a websocket is opened. I'm fine without a warning; if you'd like to keep it put it behind some kind of flag so it only logs once (I don't think we have an existing idiom for this).

oliver-sanders · 2025-02-28T13:08:59Z

@bdarnell - Apologies for the delay.

I have implemented the above suggestions and found a way to test.

One change to the above discussion, I changed the default ping timeout to min(ping_interval, 30) (i.e. use the maximum permitted timeout up to 30 seconds, then continue with 30 seconds thereafter).

bdarnell · 2025-03-20T20:12:45Z

tornado/websocket.py

@@ -97,6 +99,9 @@ def log_exception(

 _default_max_message_size = 10 * 1024 * 1024

+# log to "gen_log" but suppress duplicate log messages
+de_dupe_gen_log = functools.lru_cache(gen_log.log)


Neat trick; I haven't seen this one before.

bdarnell · 2025-03-20T20:20:35Z

tornado/websocket.py

@@ -274,17 +279,40 @@ async def get(self, *args: Any, **kwargs: Any) -> None:

    @property
    def ping_interval(self) -> Optional[float]:
-        """The interval for websocket keep-alive pings.
+        """Send periodic pings down the websocket.


Doc style nit: This is a property, not a method, so it should be described as a noun and not a verb.

What is "This" in the next sentence? Should say something like "If this is non-zero, the websocket will send a ping every ping_interval seconds".

We should probably explain somewhere the relationship between these ping_interval properties and the corresponding websocket_ping_interval app setting if it's not already clear.

bdarnell · 2025-03-20T20:32:15Z

tornado/websocket.py

            return timeout
-        assert self.ping_interval is not None
-        return max(3 * self.ping_interval, 30)
+        return max(self.ping_interval, 30)


This disagrees with the docs, which say min instead of max. It needs to be min to satisfy the new rule.

Should that 30 even be there at all? I think it would be simplest and most consistent to just say that if ping_interval is set, ping_timeout defaults to ping_interval (and ping_timeout can be explicitly set to any value <= ping_interval if desired)

Corrected the max (sorry).

Should that 30 even be there at all?

A reasonable ping interval might be 1 hour, but I don't think it would ever be reasonable for a websocket pong response to take longer than 30 seconds.

A reasonable ping interval might be 1 hour, but I don't think it would ever be reasonable for a websocket pong response to take longer than 30 seconds.

Is an hour a reasonable ping interval? That's a long time to wait to detect a failed connection.

On the other side, it's not only about the time a response takes, it's about how long the device can go to sleep before the connection is broken. If my router will leave a connection open while my laptop/phone is asleep, I don't think we need to have a separate hard-coded timeout here.

Stepping back, applications don't really care about the details here as much as they care about the time from the connection breaking to the connection-close callback. So even though interval=3600,timeout=30 has a much lower timeout than interval=timeout=3600, it only cuts the time-to-detect in half. I think it feels reasonable to say "by default failed connections will be detected in 2x the interval in the worst case; adjust the timeout value if you want to" instead of picking a hard-coded limit here (especially when my expectation is that most people who use timeouts with websockets will want an interval closer to a minute than an hour).

bdarnell · 2025-03-20T20:37:59Z

tornado/websocket.py

-            and since_last_pong > self.ping_timeout
-        ):
-            self.close()
+        if interval <= 0:


This check should probably go in start_pinging so we don't create a task unnecessarily.

bdarnell · 2025-03-20T20:39:11Z

tornado/websocket.py

+            # make sure we received a pong within the timeout
+            if (
+                timeout > 0
+                # and ping_time - self.last_pong >= timeout


This doesn't look like it was supposed to be left commented.

bdarnell · 2025-03-20T20:45:29Z

tornado/websocket.py

+                # and ping_time - self.last_pong >= timeout
+                and (
+                    # pong took longer than the timeout
+                    self.last_pong - ping_time > timeout


This looks like it is only possible in edge cases: the timeout has passed, but before this coroutine could be rescheduled, the pong was received. I don't think that matters and we could simplify a little bit. I think last_pong could be replaced with a received_pong boolean that is set to false when we send a ping and to true when we receive a pong.

Removed this check (if the client responded slightly outside of the timeout but the server didn't check within this window, we might as well keep the connection open).

Replaced last_pong: int with _received_pong: bool.

bdarnell · 2025-03-20T20:47:13Z

tornado/websocket.py

+                return
+
+            # wait until the next scheduled ping
+            await asyncio.sleep(interval - timeout)


This will drift a little bit, especially if your IOLoop gets blocked from time to time. I'm not sure that really matters but if we care we could sleep for IOLoop.current().time() - ping_time + interval.

Shouldn't matter, but went with your suggestion anyway (might as well be as accurate with our pings as the clock allows).

bdarnell · 2025-03-20T20:47:57Z

tornado/test/websocket_test.py

@@ -831,14 +835,79 @@ def on_ping(self, data):

    @gen_test
    def test_client_ping(self):
-        ws = yield self.ws_connect("/", ping_interval=0.01)
+        ws = yield self.ws_connect("/", ping_interval=0.01, ping_timeout=0)
        for i in range(3):
            response = yield ws.read_message()
            self.assertEqual(response, "got ping")
        # TODO: test that the connection gets closed if ping responses stop.


I think we can remove this TODO with your new test.

bdarnell · 2025-03-20T20:53:12Z

tornado/test/websocket_test.py

+        class PingHandler(TestWebSocketHandler):
+            def initialize(self, close_future=None, compression_options=None):
+                # capture the handler instance so we can interrogate it later
+                nonlocal handlers


FYI this nonlocal statement is unnecessary (I use the pattern of a mutable list because it worked before nonlocal was introduced; I haven't bothered to learn how to use nonlocal myself)

Removed.

(I just have a habit of using nonlocal in these situations for clarity)

bdarnell · 2025-03-20T21:20:49Z

tornado/test/websocket_test.py

+                # print(f'$ {oppcode} {data}')  # uncomment to debug
+                if oppcode == 0xA:  # NOTE: 0x9=ping, 0xA=pong
+                    from time import sleep
+                    sleep(delay)


Blocking time.sleep seems problematic here - it's not a realistic simulation of what happens in production (but now I see why you did the time checking below).

How about this as a testing technique: instead of delay_pong, do drop_ping which replaces protocol.write_ping with a no-op (you can use unittest.mock.patch for this). Then pings don't get sent (even though the ping coroutine thinks they are), pongs don't get sent, and the connection should get closed on its own.

I wanted to test the actual timings as closely as I could, but agree, it's a bit bunk.

I've changed this from delaying the client pong, to just suppressing the client pong (no need to suppress the server ping).

I haven't used patch for this as it's an instance method (so doesn't need to be reset post testing) and it's easier to target just the pongs with a wrapper.

* Closes tornadoweb#3258 * Closes tornadoweb#2905 * Closes tornadoweb#2655 * Fixes an issue with the calculation of ping timeout interval that could cause connections to be erroneously timed out and closed from the server end.

bdarnell

Looks good. After CI I can merge and take care of the max(30).

Appears to be necessary on windows.

lukasmasuch · 2025-06-18T16:15:30Z

Streamlit users are reporting an issue with the websocket ping timeout: streamlit/streamlit#11670 and it seems that this aspect was changed in this PR. The quick solution for us is probably to set the websocket_ping_interval to 30s and hope that it doesn't bring back dropped connections.

I believe the logic with our current setting was to use a small websocket_ping_interval to keep the connection alive even if there are some proxy timeouts, and a relatively high websocket_ping_timeout to have some flexibility for latency or quick disconnects.

I'm also a bit confused by the docs. This page are saying about websocket_ping_timeout.

The default is three times the ping interval, with a minimum of 30 seconds.

which contradicts with:

Note, the ping timeout cannot be longer than the ping interval.

Is this info outdated? Do you have a recommendation for what might be a good config value for Streamlit?

cc @bdarnell @oliver-sanders

oliver-sanders · 2025-06-19T07:51:05Z

Do you have a recommendation for what might be a good config value for Streamlit?

Perhaps interval=30s, timeout=30s?

30s is a regular ping and also a pretty generous timeout, 1s sounds a bit extreme!? In an app I work on, if the client is disconnected, we use an exponential backoff to reconnect, e.g. attempt to reconnect after 0.5s, 1s, 2s, ... up to a maximum reconnect period. This works fairly well.

With the present implementation, it is not possible to implement ping timeouts longer than the ping interval because that would require a mechanism to associate the pong response from the client with the ping it was in response to. @bdarnell noted above that it might be possible to include some data with the ping which should be returned with the pong. Providing this is well supported by clients, this could enable ping timeouts longer than the ping interval (e.g. include the time the ping was sent as a payload and compare it when the pong is received).

which contradicts with:

Didn't spot that with this PR, yes, this is outdated, have found the source in docs/web.rst, will take a look.

…out` (#11693) ## Describe your changes Tornado changed the logic on how `websocket_ping_interval` and `websocket_ping_timeout` are handled. With the latest version, the `websocket_ping_interval` can never be less than the `websocket_ping_timeout`. This leads to an unexpected ping timeout of 1s when the current Streamlit is using the latest tornado version. To fix this, we are updating the `websocket_ping_interval` to use 30s (suggestion from Tornado maintainer tornadoweb/tornado#3376 (comment)) This could potentially bring back an old issue with dropping connections in specific proxy setups - #3196. Something we need to watch out for after this gets released. ## GitHub Issue Link (if applicable) - Closes #11670 ## Testing Plan - Added unit test. --- **Contribution License Agreement** By submitting this pull request you agree that all contributions to this project are made under the Apache 2.0 license.

oliver-sanders mentioned this pull request Apr 30, 2024

config: configure default websocket ping interval cylc/cylc-uiserver#586

Open

8 tasks

toredash mentioned this pull request May 14, 2024

streamlit does not produce WebSocket ping frames streamlit/streamlit#8660

Open

4 tasks

bdarnell reviewed Jul 9, 2024

View reviewed changes

tornado/websocket.py Outdated Show resolved Hide resolved

bdarnell added the websocket label Sep 27, 2024

oliver-sanders force-pushed the 3258 branch from 866acfc to bfc5716 Compare February 28, 2025 12:57

bdarnell reviewed Mar 20, 2025

View reviewed changes

websockets: fix ping_timeout

ec57518

* Closes tornadoweb#3258 * Closes tornadoweb#2905 * Closes tornadoweb#2655 * Fixes an issue with the calculation of ping timeout interval that could cause connections to be erroneously timed out and closed from the server end.

oliver-sanders force-pushed the 3258 branch from bfc5716 to ec57518 Compare March 21, 2025 15:12

bdarnell approved these changes Apr 2, 2025

View reviewed changes

bdarnell added 3 commits April 2, 2025 15:27

websocket: Fix lint, remove hard-coded 30s default timeout

f4a83e8

websocket_test: Improve assertion error messages

fa0b5bc

websocket_test: Allow a little slack in ping timing

d14d2bf

Appears to be necessary on windows.

bdarnell merged commit 5b349e5 into tornadoweb:master Apr 22, 2025
15 checks passed

lukasmasuch mentioned this pull request Jun 18, 2025

connection breaks up because WebSocketProtocol13.ping_timeout=1, but expected to be 30 streamlit/streamlit#11670

Closed

4 tasks

oliver-sanders deleted the 3258 branch June 19, 2025 07:43

lukasmasuch mentioned this pull request Jun 25, 2025

Fix the websocket_ping_interval to be at least websocket_ping_timeout streamlit/streamlit#11693

Merged

websockets: fix ping_timeout #3376

websockets: fix ping_timeout #3376

Uh oh!

Conversation

oliver-sanders commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oliver-sanders commented Jul 11, 2024

Uh oh!

bdarnell commented Sep 27, 2024

Uh oh!

oliver-sanders commented Feb 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oliver-sanders Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukasmasuch commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

oliver-sanders commented Apr 30, 2024 •

edited

Loading

oliver-sanders Mar 21, 2025 •

edited

Loading

oliver-sanders Mar 21, 2025 •

edited

Loading

lukasmasuch commented Jun 18, 2025 •

edited

Loading

oliver-sanders commented Jun 19, 2025 •

edited

Loading