Disconnect instantly while reconnecting#197
Conversation
|
I just started oncall and will try to take a look at this as part of that. We have a longer-term refactor/evaluation of the reconnection logic planned as part of the overall iot quality pass we're working through. I think this would be a good thing to get in beforehand. I would like to move the connection's event loop to a stable one-time choice so that even when there's no channel we execute all tasks (like reconnect/shutdown) on a single thread of execution, but that will need to wait for the refactor. |
That sounds like a good idea, as channels come and go together with typical network connectivity issues, so having a static event loop throughout the lifetime of a connection (shared with its channels?) sounds more stable. In this PR I focused just on minimizing the changes, and therefore there's the two paths to shutdown, the old one where channel(+connection) shutdown gets scheduled into the channel event loop when possible, and the new where only connection shutdown is run in client bootstrap event loop when there's no channel. I'm also fully OK with you modifying these changes as you see fit, as I'm not on super firm ground on understanding the whole channel/connection/event loop -system, and also because our opposing time zones might make back-and-forth changes slow. |
We noticed that trying to shut down clients in reconnecting-state may take several minutes before the
on_disconnectcallback gets called. The reason appears to be that when theconnectionhas nochannelwhen it has no network, the disconnection request will not be handled until the next scheduled reconnect-task is run. This may take up to 128 seconds, which is a bit long for e.g. demanding people to wait until they can reboot their computer. In addition to this, if the client is running as a Windows service and the computer is in S0 low-power idle state, the service process gets only a fraction of normal CPU time, and it may take over 30 minutes before the reconnect task gets run.It looks like the disconnection task is heavily tied with a
channelof theconnection, which is not available when there's no network, so currently we wait until a reconnect-task tries to create a newchannel. However once the disconnect-request has been made, the newchannelwill be only used for callings_mqtt_client_shutdown, which is to be called after thechannelhas completed its shutdown, so having achannelshould not be needed?This patch skips the wait for the reconnect-task, and instead lets the disconnect-task to call
s_mqtt_client_shutdowndirectly in case there's no channel. This shutdown function was moved from client_channel_handler.c to client.c to get access tostatic void s_mqtt_client_shutdown()defined within client.c. It also feels like it fits there almost better than the channel-file, as the only link to channels appears to be thats_mqtt_disconnect_taskshould ensure that the channel has been shut down befores_mqtt_client_shutdowngets called.Testing this did not show any obviuous side-effects even for the case where reconnect task gets run after disconnect, as the call
aws_atomic_store_ptr(&connection->reconnect_task->connection_ptr, NULL)froms_mqtt_disconnect_taskappears to cause the reconnect tasks_attempt_reconnectto bail out due to connection pointer being NULL.One worry pointed out by @kankri about this patch is the small but possible chance that the
s_mqtt_disconnect_taskands_attempt_reconnecttasks would get run simultaneously by two separate aws event loop threads, and thens_attempt_reconnectacquires a pointer toconnectionwhose content might get freed and/or invalidated bys_mqtt_disconnect_taskbefores_attempt_reconnectis done. But I guess this is already a potential problem, because the comment "If there is an outstanding reconnect task, cancel it" insides_mqtt_disconnect_taskalready assumes that there might be a reconnect task scheduled for the short term future? We could make a separate issue out of that.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.