[WIP][shelly] Fix "duplicate id" on top of Shelly2RpcSocket fixes0#20248
[WIP][shelly] Fix "duplicate id" on top of Shelly2RpcSocket fixes0#20248markus7017 wants to merge 24 commits intoopenhab:mainfrom
Conversation
|
Why the close? |
|
If something went wrong with rebasing and such, you don't have to create a new PR, just reopen this and force-push the "fixed" branch. |
|
I created a sub-branch from yours, |
|
I reopened it, because as far as I remember, GitHub doesn't react to pushes if it's closed. |
|
This PR will reflect exactly what is in your |
|
Seems like you got it now 👍 |
|
ok, the stash/push was missing
|
Have you checked what the Shelly devices expect? Is there any API documentation available? There are different ways to do keepalive, and what's the best way depends on how it's implemented in the devices. Have you verified the timestamps for the disconnects in the log - do they look like they happen a certain time after either connect or last transmission? It might not be a "keepalive" issue at all. |
|
API spec could be found here: I can't find any spec/recommendation how to handle keep alive The integration is super simple. The binding polls on the http channel anyways once a minute I use the same time to fire a anyc Shelly.GetStatus on the WS channel - so no overhead, minimal change (in fact getStatus(true) does this, which is called from ShellyBaseHandler.refreshStatus() |
It seems that the code that is running in the Shellys can be found here: https://github.com/home-assistant-libs/aioshelly/blob/main/aioshelly/rpc_device/wsrpc.py It doesn't contain too many details, because the technical details are handled by There's no PING/PONG handling in the linked code, but it could reside inside I think it would be useful to try to figure out what actually leads to the disconnects - but it's hard to do with the information I have. |
|
I really hate Python, but as far as I understand, this is where the websocket is created: The library has an automatic WS ping functionality that can be sent every X seconds by using:
This means that, if we send a PING, we shoud get a PONG in response, since they don't explicitly disable |
I'm super confused by the polling logic. It seems to me like the binding runs "refresh" every 3 seconds... which is extremely often. And, a lot of stuff takes place, none of which is thread-safe, although threads are used to execute the refreshes ("scheduled" jobs are executed by a thread pool). So, I'm not even able to follow exactly when what is called, but why does gen2 have this "refresh" in the first place - since they are event based, things should happen as events arrive, not on a schedule..? |
|
I found that the default Jetty WS idle timeout is 300000 ms/5 min, and I can't see that the binding overrides this. That means that devices will be disconnected by Jetty if nothing is sent in 5 min. It's a bit strange if that results in an "ungraceful" disconnect, but who knows. |
|
@markus7017 I made a suggested implementation for WS ping here: Nadahar@fb28951 I didn't make any handling if the ping failed, except to log it. I don't know if there's any point, I assume that if it fails, |
That's not the code running on Shelly side, but the Home Assistant integration.
I'll work on that checking device debug etc.
It does I know that some users reported connection drops with relays almost doing nothing after a long time. I already had the WS polling in, which fixed it, but that got lost by merging/rebase across PRs. I prefer async Shelly.GetStatis, because
I don't like the WS ping/pong, because it's not mentioned in the API spec.
The basic idea was to have a central point of background tasks and avoid concurrency. Maybe bad idea, but it works :-) and I don't plan to change it. Now I added also the WS ping there, which means we don't need another background task there and it's in sync with the thing handler poll. |
Yes, that's what I thought too, but I was led to believe differently by the way the official documentation linked to aioshelly. I found it odd, but thought maybe they had done some kind of cooperation with HA for the functionality. But, reading it again, you're probably right. From what I can see from the documentation, the Shellys shouldn't require any keepalive. Jetty on the other hand, probably does, but not until I would be very hesitant to add polling every minute, leading to a lot of unnecessary processing on both sides, and network utilization. The devices must have limited processing power, and I can't understand how battery devices wouldn't suffer a lot from this. And, what will it achieve? If the event system works and no events have arrived, it literally means that nothing has changed, so the processing is "for nothing".
I was pretty sure that I had seen it somewhere, but I couldn't find it... have you verified that it actually works to specify it like this, and that the 5 minutes default timeout doesn't apply? You can query the session for how long it has left until idle timeout, adding some temporary debug logging for that could help remove all doubt.
I interpret it differently. As far as I can see, the API says nothing about that You don't know that it will keep WS open, that is only true if the reason for the disconnections is idle timeout. It don't know how the Notify subscription should "get lost". That would mean a bug in the firmware. As long as the connection stays open, it's pretty obvious that the subscription is still "active". WS ping/pong doesn't need to be explicityly mentioned, as it's a part of the WS spec. It is the way you're supposed to handle keepalive, not only for the endpoints, but for any "proxies" on the way (not that this seems very relevant in this case). WS ping/pong is designed to be as "light" as possible, it mostly handled on the WS library level, and that's intentional. It results in minimal processing, minimal transmission size (it's mandated that it must fit within a single packet), and it is the very thing designed to be used for this purpose.
As I don't understand half of what it does, I can't really judge that part, but I can't understand that it has anything to "process" as long as nothing has changed. But, it doesn't avoid concurrency. It's the opposite, any kind of scheduling in Java means different threads (unlike e.g. JavaScript where all schedules are executed by the one and only thread). It won't be the same thread every time, and it won't always happen on time (if the OH thread pool is starved for other reasons, another binding that does something "bad", or a lot of stuff happening at once). At the same time, requests from other parts of the system, from other threads, will make requests to the binding to send commands or fetch data. This will happen concurrently. I won't say that 60 seconds is "wrong" for a gen1 device, although usually I'd think that the refresh interval should be configurable (some devices might need frequent refreshes while others can live with long delays). But, for an event based system, which is what WS is, it's "just wrong". It goes against the very idea of events - the whole benefit they are supposed to provide is that you don't have to poll. It saves everything from load and is just a better design. The whole polling thing is mostly a result of the limitations of HTTP, if it wasn't for the fact that HTTP is "one way", it wouldn't even have been "a broadly accepted idea" in the first place. Compare it to me waiting for the shower to be available. I can either tell my wife "Inform me when you're done in the shower" (event based), or I could ask here every 30 seconds until she's done (polling). I don't think my marriage would last long if I relied on polling 😉 |
|
I get what you are saying. However, if your wife listens to the radio she might forget to inform you the show is empty... The binding has several fallback strategies for good reasons - I saw a lot of firmware issues over the years. I'll definitely keep the "poll once per minute on the http channel". The is no need to add more config options, there was never the request to change the interval. This routine also handles connection recovery etc. Nevertheless, let's try WS PING first, but I'll not add another background task adding more concurrency. I'll change the current approach to send PING, but also add a callback to handle inbound PING. If we use it, we should be prepared that the device might use it too. Also thr PONG can be used to extend the watchdog timer. As a result, network utilization is a tick lower, but still one message per minute, which needs to be processed on both sides. Battery devices are not affected, because they don't use a always on connection. Device connects to the binding, sends an event, disconnects and goes back to sleep mode. So
I need to look it up, but I had one user with connection idles when device does't send updates after a long time. He could help to verify the new mechanism. I'll remove the logic from this PR and create a separate one, because all this is out-of-scope from "fix duplicate id" ;-) |
Note: The class isn't fully thread-safe because that would mean to make all (final) field implementations thread-safe as well. But, the class itself is. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Note: The class isn't fully thread-safe because that would mean to make all (final) field implementations thread-safe as well. But, the class itself is. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…profile immutable. Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…he ThingHandler Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…s (since they can't be restarted) Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…vlet Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…ymore Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Markus Michels <markus7017@gmail.com>
Signed-off-by: Markus Michels <markus7017@gmail.com>
7f013d5 to
78e6091
Compare
devices Signed-off-by: Markus Michels <markus7017@gmail.com>
Yes, but luckily, programmed devices/computers don't "forget". So, unless you have "proof" that the even system isn't working, you'd better trust it IMO. The point is that polling is a very bad thing in so many ways, and "the world" is trying to get past it. It only became "a thing" because of the shortcomings of HTTP in the first place. I know that there are situations where events don't work (for other devices or software), but I only resort to polling if there's no other way. In this situation, there is a "live" connection, as long as that connection is up and running, I would assume that events are working.
I don't understand what you mean here. If you look at my example implementation, there's no "more concurrency". It's using the same thread pool as your "refresh task", it will be executed by one of the same threads. No matter how you twist it, it is concurrency here - but as always, ignoring concurrency doesn't immediately break something. It just makes it fail in strange and unpredictable ways from time to time, isn't generally reproducible, and is very hard to track down/debug. It just adds "a layer of buginess" over everything. The only way I can think of to avoid concurrency when using scheduling in Java, is to create a scheduled thread pool with only one thread. But, it won't really help, because OH itself is multithreaded, so when other parts of the system invokes some of the binding's methods, it won't be from the same thread. There is simply no practical way of making it single threaded - and if it were, OH would perform terribly bad. So, as I see it, there are only two options:
I have my doubts that the Shellys send PINGs. If so, there would be nothing for the binding to handle, those alone would be enough to keep the connection alive. But, the Shellys, probably to save energy and network traffic, don't seem to use it, and probably has no WS connection timeout instead. Jetty however, doesn't have that option, so we need some kind of "keepalive", even though it might potentially be a very rare event (if the timeout is long, the interval could be equally long).
I don't think there's a PING to catch, however, if you send PINGs, I expect that you will receive PONGs. You must implement a special Jetty interface to be notified of PONGs - and it's possible to send a payload with a PING (like a tracking number) and the same payload will be sent back with the PONG. I don't know how the watchdog is working, but you might be able to use this for something.
Good idea 👍 That said, I'm still skeptical of whether the disconnects are caused by idle timeout at all. I would recommend (temporarily) trying to find a way to log "the remaining idle time" for a session/connection to get some insight, but I'm not sure how easy it is to get this information. You could also try setting the idle timeout really short, so that you can observe exactly what happens during an idle timeout. Maybe it happens in a way (with e.g. a status) that leaves no doubt about the cause, which means that we don't have to speculate whether or not it is the reason for the disconnects that are the problem here. |
|
I want to wait with intense analysis until we have the pending PRs merged. My belly feeling says: The root cause is not the concurrency, but at the moment we have various potential sources. Did you already reviewed the code here? |
used) Signed-off-by: Markus Michels <markus7017@gmail.com>
I don't think concurrency is the cause for the disconnects either. Of course, it could be somehow involved, but it's hard to imagine how the lack of synchronization alone could cause that. It also seems to happen way too often (to be caused by concurrency issues). If it only happened rarely, or when the system was under a lot of stress, I would be inclined to suspect concurrency issues as the primary cause, but the current behavior "doesn't really fit". My comments on concurrency issues were general in nature. What I meant to say is that it's often hard to make people realize that they must care about it, because it's often hard to "prove" exactly how the failure happens, it's often impossible to reproduce consistently, which means that many choose to simply ignore it because it's easier. But the bugs this causes are still very much real, and just the kind of bugs that can be very hard to live with: unpredictable, no way to "work around".
More or less yes. I didn't follow all the details of the profile handling, but I'm not sure that it's important that I do. I can't do a "formal review" until #20069 has been merged and this PR has been rebased anyway. |
Nadahar
left a comment
There was a problem hiding this comment.
I went through and made a few comments, but there's really nothing of importance here.
|
|
||
| private static final String RPC_SRC_PREFIX = "ohshelly-"; | ||
| private static final int MAX_ID = 10000; | ||
| private final AtomicInteger requestId = new AtomicInteger(ThreadLocalRandom.current().nextInt(1, MAX_ID + 1)); |
There was a problem hiding this comment.
I'm not saying that it's a problem, I'm just wonder if ThreadLocalRandom was used because it's easier, or because there is some idea that it protects against concurrency issues. If it's just easy/convenient, that's fine, but it won't "protect" anything. The AtomicInteger will protect, all I'm saying is that any source of random would work just as fine.
|
|
||
| if (!profile.initialized && profile.alwaysOn) { | ||
| getStatus(); // make sure profile.status is initialized (e.g,. relay/meter status) | ||
| asyncApiRequest(SHELLYRPC_METHOD_GETSTATUS); // request periodic status updates from device |
There was a problem hiding this comment.
Here I don't follow completely, but I'm not even sure that it matters - or that it will be a part of the final code. I can't find any indication that asyncApiRequest will actually execute anything periodically.
| config.deviceIp = address; | ||
| config.userId = bindingConfig.defaultUserId; | ||
| config.password = bindingConfig.defaultPassword; | ||
| config.serviceName = serviceName; |
There was a problem hiding this comment.
I'm not sure what the serviceName is, but again, probably not important.
@Nadahar While waiting on the merge of PR #20069 I prepared the next one on top of that