-
Notifications
You must be signed in to change notification settings - Fork 5.5k
websockets: fix ping_timeout #3376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this is probably going to involve some ugly hacks one way or another. My first thought is to just overwrite the write_ping method: self.ws_connection.write_ping = lambda x: None
(and factor out a write_pong
method for the other side where it currently calls self._write_frame(True, 0xA, data)
).
I think there are multiple distinct problems with this ping timeout code:
- If
ping_timeout == ping_interval
, the connection breaks before the first ping is sent, and if that is avoided, it is likely to break other times due to race conditions. This was your finding in websocket disconnect and reconnect every 10 minutes #3258 (comment) and is fixed by this PR - If
ping_timeout > ping_interval
, we are slow to detect failure - we don't check for timeout until the next ping is sent (websocket: Fix ping_timeout < ping_interval #2905). This is also fixed by your PR (and it's a good reason to restructure the code instead of a small change to thesince_last_(ping,pong)
logic inperiodic_ping
. - Other folks, such as the original message in websocket disconnect and reconnect every 10 minutes #3258, have reported problems when
ping_interval
andping_timeout
are not the same. It's not clear to me what exactly the problem is here, or whether this PR fixes it. - Does it actually make sense to set
ping_interval < ping_timeout
? That's the default, but since websockets go over TCP I'm not sure it makes sense to allow multiple pings in flight before detecting failure. I'm not sure what I was thinking when this was written/merged, but now I feel like a sensible configuration would be something likeping_interval=30; ping_timeout=5
.
I did find it a bit strange that this is possible.
I don't think pings & pongs can carry any [meta]data, so it is not possible to associate a pong with the ping that it was responding to. As a result,
👍
I did consider this, but thought of it more as the result of a quirky configuration rather than a bug in the implementation. For configurations where
One option would be to reject such configurations: assert ping_timeout is None or ping_timeout <= ping_interval However, that might be disruptive. Another option would be to raise a warning and fall back to the if ping_timeout > ping_interval:
app_log.warn(
'The configured websocket_ping_timeout is longer than the websocket_ping_interval.'
'\nSetting websocket_ping_timeout = websocket_ping_interval'
)
ping_timeout = ping_interval
Setting Here are my suggestions:
What do you think? |
They can, actually. RFC 6455 section 5.5.2 and 5.5.3 say that a ping frame may include "application data", and a pong frame must copy the application data from the ping frame it is responding to. So the information is there to match up pings and pongs. But even then, would it be useful to send a flurry of pings faster than the RTT? I'm not seeing any use case for that (at least in HTTP/1 and HTTP/2 - maybe in HTTP/3 depending on how they model the websocket stream in a reorderable connection)
I don't think that's true - there's no early exit from the sleep, so it will ping at ping_timeout instead of ping_interval. That's what I was getting at when I said "unconditional sleep". Am I missing something? I'd be OK with the behavior you're describing in this message but not what I think the code is doing.
This is probably the right answer; I only hesitate because the default
Yes, except that I'd rather not have a warning at all than have a warning every time a websocket is opened. I'm fine without a warning; if you'd like to keep it put it behind some kind of flag so it only logs once (I don't think we have an existing idiom for this). |
@bdarnell - Apologies for the delay. I have implemented the above suggestions and found a way to test. One change to the above discussion, I changed the default ping timeout to |
@@ -97,6 +99,9 @@ def log_exception( | |||
|
|||
_default_max_message_size = 10 * 1024 * 1024 | |||
|
|||
# log to "gen_log" but suppress duplicate log messages | |||
de_dupe_gen_log = functools.lru_cache(gen_log.log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat trick; I haven't seen this one before.
tornado/websocket.py
Outdated
@@ -274,17 +279,40 @@ async def get(self, *args: Any, **kwargs: Any) -> None: | |||
|
|||
@property | |||
def ping_interval(self) -> Optional[float]: | |||
"""The interval for websocket keep-alive pings. | |||
"""Send periodic pings down the websocket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc style nit: This is a property, not a method, so it should be described as a noun and not a verb.
What is "This" in the next sentence? Should say something like "If this is non-zero, the websocket will send a ping every ping_interval
seconds".
We should probably explain somewhere the relationship between these ping_interval
properties and the corresponding websocket_ping_interval
app setting if it's not already clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorted.
tornado/websocket.py
Outdated
return timeout | ||
assert self.ping_interval is not None | ||
return max(3 * self.ping_interval, 30) | ||
return max(self.ping_interval, 30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disagrees with the docs, which say min
instead of max
. It needs to be min
to satisfy the new rule.
Should that 30 even be there at all? I think it would be simplest and most consistent to just say that if ping_interval
is set, ping_timeout
defaults to ping_interval
(and ping_timeout
can be explicitly set to any value <= ping_interval
if desired)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected the max
(sorry).
Should that 30 even be there at all?
A reasonable ping interval might be 1 hour, but I don't think it would ever be reasonable for a websocket pong response to take longer than 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reasonable ping interval might be 1 hour, but I don't think it would ever be reasonable for a websocket pong response to take longer than 30 seconds.
Is an hour a reasonable ping interval? That's a long time to wait to detect a failed connection.
On the other side, it's not only about the time a response takes, it's about how long the device can go to sleep before the connection is broken. If my router will leave a connection open while my laptop/phone is asleep, I don't think we need to have a separate hard-coded timeout here.
Stepping back, applications don't really care about the details here as much as they care about the time from the connection breaking to the connection-close callback. So even though interval=3600,timeout=30 has a much lower timeout than interval=timeout=3600, it only cuts the time-to-detect in half. I think it feels reasonable to say "by default failed connections will be detected in 2x the interval in the worst case; adjust the timeout value if you want to" instead of picking a hard-coded limit here (especially when my expectation is that most people who use timeouts with websockets will want an interval closer to a minute than an hour).
tornado/websocket.py
Outdated
and since_last_pong > self.ping_timeout | ||
): | ||
self.close() | ||
if interval <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should probably go in start_pinging so we don't create a task unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tornado/websocket.py
Outdated
# make sure we received a pong within the timeout | ||
if ( | ||
timeout > 0 | ||
# and ping_time - self.last_pong >= timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like it was supposed to be left commented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
tornado/websocket.py
Outdated
# and ping_time - self.last_pong >= timeout | ||
and ( | ||
# pong took longer than the timeout | ||
self.last_pong - ping_time > timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is only possible in edge cases: the timeout has passed, but before this coroutine could be rescheduled, the pong was received. I don't think that matters and we could simplify a little bit. I think last_pong
could be replaced with a received_pong
boolean that is set to false when we send a ping and to true when we receive a pong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Removed this check (if the client responded slightly outside of the timeout but the server didn't check within this window, we might as well keep the connection open).
- Replaced
last_pong: int
with_received_pong: bool
.
tornado/websocket.py
Outdated
return | ||
|
||
# wait until the next scheduled ping | ||
await asyncio.sleep(interval - timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will drift a little bit, especially if your IOLoop gets blocked from time to time. I'm not sure that really matters but if we care we could sleep for IOLoop.current().time() - ping_time + interval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't matter, but went with your suggestion anyway (might as well be as accurate with our pings as the clock allows).
tornado/test/websocket_test.py
Outdated
@@ -831,14 +835,79 @@ def on_ping(self, data): | |||
|
|||
@gen_test | |||
def test_client_ping(self): | |||
ws = yield self.ws_connect("/", ping_interval=0.01) | |||
ws = yield self.ws_connect("/", ping_interval=0.01, ping_timeout=0) | |||
for i in range(3): | |||
response = yield ws.read_message() | |||
self.assertEqual(response, "got ping") | |||
# TODO: test that the connection gets closed if ping responses stop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this TODO with your new test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tornado/test/websocket_test.py
Outdated
class PingHandler(TestWebSocketHandler): | ||
def initialize(self, close_future=None, compression_options=None): | ||
# capture the handler instance so we can interrogate it later | ||
nonlocal handlers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this nonlocal
statement is unnecessary (I use the pattern of a mutable list because it worked before nonlocal
was introduced; I haven't bothered to learn how to use nonlocal
myself)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
(I just have a habit of using nonlocal
in these situations for clarity)
tornado/test/websocket_test.py
Outdated
# print(f'$ {oppcode} {data}') # uncomment to debug | ||
if oppcode == 0xA: # NOTE: 0x9=ping, 0xA=pong | ||
from time import sleep | ||
sleep(delay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking time.sleep
seems problematic here - it's not a realistic simulation of what happens in production (but now I see why you did the time checking below).
How about this as a testing technique: instead of delay_pong
, do drop_ping
which replaces protocol.write_ping
with a no-op (you can use unittest.mock.patch
for this). Then pings don't get sent (even though the ping coroutine thinks they are), pongs don't get sent, and the connection should get closed on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to test the actual timings as closely as I could, but agree, it's a bit bunk.
I've changed this from delaying the client pong, to just suppressing the client pong (no need to suppress the server ping).
I haven't used patch
for this as it's an instance method (so doesn't need to be reset post testing) and it's easier to target just the pongs with a wrapper.
* Closes tornadoweb#3258 * Closes tornadoweb#2905 * Closes tornadoweb#2655 * Fixes an issue with the calculation of ping timeout interval that could cause connections to be erroneously timed out and closed from the server end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. After CI I can merge and take care of the max(30).
Closes #3258
Closes #2905
Closes #2655
Fixes an issue with the calculation of ping timeout interval which caused connections to be erroneously closed from the server end. This could happen shortly after the connection was opened, before the ping was even sent (as reported in #3258) as the result of startup logic. It could also happen when pong responses were sent back from the client within the configured timeout as the result of a race condition.
To test this fix, try repeating the steps in this example: #3258 (comment)
I found a TODO to implement testing for ping timeout. I tried to fill this in but couldn't get it to work, any pointers appreciated (note, it has to be tested from the server side).