Skip to content

Comments

[WIP][shelly] Fix "duplicate id" on top of Shelly2RpcSocket fixes0#20248

Open
markus7017 wants to merge 24 commits intoopenhab:mainfrom
markus7017:shelly_shelly2rpcsocket_duplid
Open

[WIP][shelly] Fix "duplicate id" on top of Shelly2RpcSocket fixes0#20248
markus7017 wants to merge 24 commits intoopenhab:mainfrom
markus7017:shelly_shelly2rpcsocket_duplid

Conversation

@markus7017
Copy link
Contributor

@Nadahar While waiting on the merge of PR #20069 I prepared the next one on top of that

@markus7017 markus7017 added work in progress A PR that is not yet ready to be merged awaiting other PR Depends on another PR labels Feb 18, 2026
@markus7017 markus7017 closed this Feb 18, 2026
@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

Why the close?

@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

If something went wrong with rebasing and such, you don't have to create a new PR, just reopen this and force-push the "fixed" branch.

@markus7017
Copy link
Contributor Author

markus7017 commented Feb 18, 2026

I created a sub-branch from yours,
made the changes
pushed this branch to my repo
but it doesn't catch my additions

@Nadahar Nadahar reopened this Feb 18, 2026
@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

I reopened it, because as far as I remember, GitHub doesn't react to pushes if it's closed.

@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

This PR will reflect exactly what is in your shelly_shelly2rpcsocket_duplid branch. When I look at that branch, 88bcd7b is the last commit (which isn't one of your commits). Just cherry-pick them from whatever branch they're in and then push.

@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

Seems like you got it now 👍

@markus7017
Copy link
Contributor Author

ok, the stash/push was missing
last commit has my changes

  • fix diplicater id
  • integrate a poll to WS socket once a minute
  • getProfile() fires getStaus() only when profile was not yet initialized

@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

  • integrate a poll to WS socket once a minute

Have you checked what the Shelly devices expect? Is there any API documentation available? There are different ways to do keepalive, and what's the best way depends on how it's implemented in the devices.

Have you verified the timestamps for the disconnects in the log - do they look like they happen a certain time after either connect or last transmission? It might not be a "keepalive" issue at all.

@markus7017
Copy link
Contributor Author

API spec could be found here:
https://shelly-api-docs.shelly.cloud/gen2/General/RPCChannels

I can't find any spec/recommendation how to handle keep alive
using GetStatus also renews the subscription if that was lost for any reason, a WS ping wouldn't

The integration is super simple. The binding polls on the http channel anyways once a minute
a) to make sure that the status is in sync
b) esp. for Gen1 devices not all values are reported by events

I use the same time to fire a anyc Shelly.GetStatus on the WS channel - so no overhead, minimal change (in fact getStatus(true) does this, which is called from ShellyBaseHandler.refreshStatus()

@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

API spec could be found here: https://shelly-api-docs.shelly.cloud/gen2/General/RPCChannels

I can't find any spec/recommendation how to handle keep alive using GetStatus also renews the subscription if that was lost for any reason, a WS ping wouldn't

It seems that the code that is running in the Shellys can be found here:

https://github.com/home-assistant-libs/aioshelly/blob/main/aioshelly/rpc_device/wsrpc.py

It doesn't contain too many details, because the technical details are handled by aiohttp. But, it doesn't look like it's expecting a keepalive. The question is if Jetty does, Jetty might find that it's "stale" and close it after too long. Could it be that the device where this happens most frequently sends information very infrequently, so that Jetty actually closes it?

There's no PING/PONG handling in the linked code, but it could reside inside aiohttp. I'm always hesitant of polling, it's such a wasteful practice, and it's one of those things that e.g. WS is supposed to get rid of. For battery devices, it could also drain their battery pretty quickly.

I think it would be useful to try to figure out what actually leads to the disconnects - but it's hard to do with the information I have.

@Nadahar
Copy link
Contributor

Nadahar commented Feb 18, 2026

I really hate Python, but as far as I understand, this is where the websocket is created:

https://github.com/home-assistant-libs/aioshelly/blob/33528a6500d01815dd035d16677ceb52ff7608c0/aioshelly/rpc_device/wsrpc.py#L570-L576

The library has an automatic WS ping functionality that can be sent every X seconds by using: WebSocketResponse(heartbeat=10.0). This isn't done here, so the Shelly won't send pings. However, it also ways that:

The autoping parameter (default True for both client and server) ensures that aiohttp automatically responds with a PONG upon receiving a PING.

This means that, if we send a PING, we shoud get a PONG in response, since they don't explicitly disable autoping. I think that's worth trying rather than keep sending HTTP requests. A ping/pong is very light, both in bandwidth and "processing". It takes hardly any effort to handle, while generating the "status" and transmissing that is much, much more "intrusive".

@Nadahar
Copy link
Contributor

Nadahar commented Feb 19, 2026

The integration is super simple. The binding polls on the http channel anyways once a minute
a) to make sure that the status is in sync
b) esp. for Gen1 devices not all values are reported by events

I'm super confused by the polling logic. It seems to me like the binding runs "refresh" every 3 seconds... which is extremely often. And, a lot of stuff takes place, none of which is thread-safe, although threads are used to execute the refreshes ("scheduled" jobs are executed by a thread pool). So, I'm not even able to follow exactly when what is called, but why does gen2 have this "refresh" in the first place - since they are event based, things should happen as events arrive, not on a schedule..?

@Nadahar
Copy link
Contributor

Nadahar commented Feb 19, 2026

I found that the default Jetty WS idle timeout is 300000 ms/5 min, and I can't see that the binding overrides this. That means that devices will be disconnected by Jetty if nothing is sent in 5 min. It's a bit strange if that results in an "ungraceful" disconnect, but who knows.

@Nadahar
Copy link
Contributor

Nadahar commented Feb 19, 2026

@markus7017 I made a suggested implementation for WS ping here: Nadahar@fb28951

I didn't make any handling if the ping failed, except to log it. I don't know if there's any point, I assume that if it fails, onError() will be called - or it or onClose() has already been called, but not in time to cancel the ping task.

@markus7017
Copy link
Contributor Author

markus7017 commented Feb 19, 2026

It seems that the code that is running in the Shellys can be found here:

https://github.com/home-assistant-libs/aioshelly/blob/main/aioshelly/rpc_device/wsrpc.py

That's not the code running on Shelly side, but the Home Assistant integration.

API spec could be found here: https://shelly-api-docs.shelly.cloud/gen2/General/RPCChannels
I can't find any spec/recommendation how to handle keep alive using GetStatus also renews the subscription if that was lost for any reason, a WS ping wouldn't

It seems that the code that is running in the Shellys can be found here:

https://github.com/home-assistant-libs/aioshelly/blob/main/aioshelly/rpc_device/wsrpc.py

It doesn't contain too many details, because the technical details are handled by aiohttp. But, it doesn't look like it's expecting a keepalive. The question is if Jetty does, Jetty might find that it's "stale" and close it after too long. Could it be that the device where this happens most frequently sends information very infrequently, so that Jetty actually closes it?

There's no PING/PONG handling in the linked code, but it could reside inside aiohttp. I'm always hesitant of polling, it's such a wasteful practice, and it's one of those things that e.g. WS is supposed to get rid of. For battery devices, it could also drain their battery pretty quickly.

I think it would be useful to try to figure out what actually leads to the disconnects - but it's hard to do with the information I have.

I'll work on that checking device debug etc.

I found that the default Jetty WS idle timeout is 300000 ms/5 min, and I can't see that the binding overrides this.

It does

@WebSocket(maxIdleTime = Integer.MAX_VALUE)

I know that some users reported connection drops with relays almost doing nothing after a long time. I already had the WS polling in, which fixed it, but that got lost by merging/rebase across PRs.

I prefer async Shelly.GetStatis, because

  • follows the API spec
  • keeps the WS open
  • and refresh the Notify subscription if this was lost

I don't like the WS ping/pong, because it's not mentioned in the API spec.

I'm super confused by the polling logic. It seems to me like the binding runs "refresh" every 3 seconds... which is extremely often. And, a lot of stuff takes place, none of which is thread-safe, although threads are used to execute the refreshes ("scheduled" jobs are executed by a thread pool). So, I'm not even able to follow exactly when what is called, but why does gen2 have this "refresh" in the first place - since they are event based, things should happen as events arrive, not on a schedule..?

The basic idea was to have a central point of background tasks and avoid concurrency. Maybe bad idea, but it works :-) and I don't plan to change it.
The scheduler runs every 3 sec and performs different status checks, every 20 times (= 60sec) a GetStatus via http, thing recovery

Now I added also the WS ping there, which means we don't need another background task there and it's in sync with the thing handler poll.

@Nadahar
Copy link
Contributor

Nadahar commented Feb 19, 2026

That's not the code running on Shelly side, but the Home Assistant integration.

Yes, that's what I thought too, but I was led to believe differently by the way the official documentation linked to aioshelly. I found it odd, but thought maybe they had done some kind of cooperation with HA for the functionality. But, reading it again, you're probably right.

From what I can see from the documentation, the Shellys shouldn't require any keepalive. Jetty on the other hand, probably does, but not until @WebSocket(maxIdleTime = Integer.MAX_VALUE) expires (it's hard to know what the effective limit is here, I doubt it's taken literally, or it would be a very long time). Which means that the reason for the disconnects probably isn't idle timeout, and adding the polling won't make a difference.

I would be very hesitant to add polling every minute, leading to a lot of unnecessary processing on both sides, and network utilization. The devices must have limited processing power, and I can't understand how battery devices wouldn't suffer a lot from this. And, what will it achieve? If the event system works and no events have arrived, it literally means that nothing has changed, so the processing is "for nothing".

It does

@WebSocket(maxIdleTime = Integer.MAX_VALUE)

I was pretty sure that I had seen it somewhere, but I couldn't find it... have you verified that it actually works to specify it like this, and that the 5 minutes default timeout doesn't apply?

You can query the session for how long it has left until idle timeout, adding some temporary debug logging for that could help remove all doubt.

I prefer async Shelly.GetStatis, because

* follows the API spec

* keeps the WS open

* and refresh the Notify subscription if this was lost

I don't like the WS ping/pong, because it's not mentioned in the API spec.

I interpret it differently. As far as I can see, the API says nothing about that GetStatus should be used this way. In fact, it says very little, except that "The connection is kept alive through the whole duration of the communication".

You don't know that it will keep WS open, that is only true if the reason for the disconnections is idle timeout. It don't know how the Notify subscription should "get lost". That would mean a bug in the firmware. As long as the connection stays open, it's pretty obvious that the subscription is still "active".

WS ping/pong doesn't need to be explicityly mentioned, as it's a part of the WS spec. It is the way you're supposed to handle keepalive, not only for the endpoints, but for any "proxies" on the way (not that this seems very relevant in this case). WS ping/pong is designed to be as "light" as possible, it mostly handled on the WS library level, and that's intentional. It results in minimal processing, minimal transmission size (it's mandated that it must fit within a single packet), and it is the very thing designed to be used for this purpose.

The basic idea was to have a central point of background tasks and avoid concurrency. Maybe bad idea, but it works :-) and I don't plan to change it.
The scheduler runs every 3 sec and performs different status checks, every 20 times (= 60sec) a GetStatus via http, thing recovery

As I don't understand half of what it does, I can't really judge that part, but I can't understand that it has anything to "process" as long as nothing has changed. But, it doesn't avoid concurrency. It's the opposite, any kind of scheduling in Java means different threads (unlike e.g. JavaScript where all schedules are executed by the one and only thread). It won't be the same thread every time, and it won't always happen on time (if the OH thread pool is starved for other reasons, another binding that does something "bad", or a lot of stuff happening at once). At the same time, requests from other parts of the system, from other threads, will make requests to the binding to send commands or fetch data. This will happen concurrently.

I won't say that 60 seconds is "wrong" for a gen1 device, although usually I'd think that the refresh interval should be configurable (some devices might need frequent refreshes while others can live with long delays). But, for an event based system, which is what WS is, it's "just wrong". It goes against the very idea of events - the whole benefit they are supposed to provide is that you don't have to poll. It saves everything from load and is just a better design. The whole polling thing is mostly a result of the limitations of HTTP, if it wasn't for the fact that HTTP is "one way", it wouldn't even have been "a broadly accepted idea" in the first place.

Compare it to me waiting for the shower to be available. I can either tell my wife "Inform me when you're done in the shower" (event based), or I could ask here every 30 seconds until she's done (polling). I don't think my marriage would last long if I relied on polling 😉

@markus7017 markus7017 self-assigned this Feb 20, 2026
@markus7017 markus7017 linked an issue Feb 20, 2026 that may be closed by this pull request
@markus7017
Copy link
Contributor Author

I get what you are saying.

However, if your wife listens to the radio she might forget to inform you the show is empty... The binding has several fallback strategies for good reasons - I saw a lot of firmware issues over the years.

I'll definitely keep the "poll once per minute on the http channel". The is no need to add more config options, there was never the request to change the interval. This routine also handles connection recovery etc.

Nevertheless, let's try WS PING first, but I'll not add another background task adding more concurrency. I'll change the current approach to send PING, but also add a callback to handle inbound PING. If we use it, we should be prepared that the device might use it too. Also thr PONG can be used to extend the watchdog timer.

As a result, network utilization is a tick lower, but still one message per minute, which needs to be processed on both sides. Battery devices are not affected, because they don't use a always on connection. Device connects to the binding, sends an event, disconnects and goes back to sleep mode.

So

  • initial async GetStatus to trigger the event mechanism
  • use ShellyBaseHandler.refreshStatus to send WA ping once per minute
  • add inbound PING handling
  • check if we could catch PING to restart watchdog timer

I need to look it up, but I had one user with connection idles when device does't send updates after a long time. He could help to verify the new mechanism.

I'll remove the logic from this PR and create a separate one, because all this is out-of-scope from "fix duplicate id" ;-)

Ravi Nadahar added 6 commits February 20, 2026 19:10
Note: The class isn't fully thread-safe because that would mean to make all (final) field implementations thread-safe as well. But, the class itself is.

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Note: The class isn't fully thread-safe because that would mean to make all (final) field implementations thread-safe as well. But, the class itself is.

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…profile immutable.

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…he ThingHandler

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Ravi Nadahar and others added 16 commits February 20, 2026 19:10
…s (since they can't be restarted)

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…vlet

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
…ymore

Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Ravi Nadahar <nadahar@rediffmail.com>
Signed-off-by: Markus Michels <markus7017@gmail.com>
Signed-off-by: Markus Michels <markus7017@gmail.com>
@markus7017 markus7017 force-pushed the shelly_shelly2rpcsocket_duplid branch from 7f013d5 to 78e6091 Compare February 20, 2026 18:10
devices

Signed-off-by: Markus Michels <markus7017@gmail.com>
@Nadahar
Copy link
Contributor

Nadahar commented Feb 20, 2026

might forget to inform you the shower is empty

Yes, but luckily, programmed devices/computers don't "forget". So, unless you have "proof" that the even system isn't working, you'd better trust it IMO. The point is that polling is a very bad thing in so many ways, and "the world" is trying to get past it. It only became "a thing" because of the shortcomings of HTTP in the first place.

I know that there are situations where events don't work (for other devices or software), but I only resort to polling if there's no other way. In this situation, there is a "live" connection, as long as that connection is up and running, I would assume that events are working.

but I'll not add another background task adding more concurrency.

I don't understand what you mean here. If you look at my example implementation, there's no "more concurrency". It's using the same thread pool as your "refresh task", it will be executed by one of the same threads. No matter how you twist it, it is concurrency here - but as always, ignoring concurrency doesn't immediately break something. It just makes it fail in strange and unpredictable ways from time to time, isn't generally reproducible, and is very hard to track down/debug. It just adds "a layer of buginess" over everything.

The only way I can think of to avoid concurrency when using scheduling in Java, is to create a scheduled thread pool with only one thread. But, it won't really help, because OH itself is multithreaded, so when other parts of the system invokes some of the binding's methods, it won't be from the same thread. There is simply no practical way of making it single threaded - and if it were, OH would perform terribly bad.

So, as I see it, there are only two options:

  • Handle concurrency and make things stable.
  • Ignore concurrency and make things unstable/unreliable, but mostly working in "clean" test environments where little else is going on.
  • add inbound PING handling

I have my doubts that the Shellys send PINGs. If so, there would be nothing for the binding to handle, those alone would be enough to keep the connection alive. But, the Shellys, probably to save energy and network traffic, don't seem to use it, and probably has no WS connection timeout instead. Jetty however, doesn't have that option, so we need some kind of "keepalive", even though it might potentially be a very rare event (if the timeout is long, the interval could be equally long).

  • check if we could catch PING to restart watchdog timer

I don't think there's a PING to catch, however, if you send PINGs, I expect that you will receive PONGs. You must implement a special Jetty interface to be notified of PONGs - and it's possible to send a payload with a PING (like a tracking number) and the same payload will be sent back with the PONG. I don't know how the watchdog is working, but you might be able to use this for something.

I'll remove the logic from this PR and create a separate one, because all this is out-of-scope from "fix duplicate id" ;-)

Good idea 👍

That said, I'm still skeptical of whether the disconnects are caused by idle timeout at all. I would recommend (temporarily) trying to find a way to log "the remaining idle time" for a session/connection to get some insight, but I'm not sure how easy it is to get this information. You could also try setting the idle timeout really short, so that you can observe exactly what happens during an idle timeout. Maybe it happens in a way (with e.g. a status) that leaves no doubt about the cause, which means that we don't have to speculate whether or not it is the reason for the disconnects that are the problem here.

@markus7017
Copy link
Contributor Author

I want to wait with intense analysis until we have the pending PRs merged. My belly feeling says: The root cause is not the concurrency, but at the moment we have various potential sources.

Did you already reviewed the code here?

used)

Signed-off-by: Markus Michels <markus7017@gmail.com>
@Nadahar
Copy link
Contributor

Nadahar commented Feb 21, 2026

The root cause is not the concurrency, but at the moment we have various potential sources.

I don't think concurrency is the cause for the disconnects either. Of course, it could be somehow involved, but it's hard to imagine how the lack of synchronization alone could cause that.

It also seems to happen way too often (to be caused by concurrency issues). If it only happened rarely, or when the system was under a lot of stress, I would be inclined to suspect concurrency issues as the primary cause, but the current behavior "doesn't really fit".

My comments on concurrency issues were general in nature. What I meant to say is that it's often hard to make people realize that they must care about it, because it's often hard to "prove" exactly how the failure happens, it's often impossible to reproduce consistently, which means that many choose to simply ignore it because it's easier. But the bugs this causes are still very much real, and just the kind of bugs that can be very hard to live with: unpredictable, no way to "work around".

Did you already reviewed the code here?

More or less yes. I didn't follow all the details of the profile handling, but I'm not sure that it's important that I do. I can't do a "formal review" until #20069 has been merged and this PR has been rebased anyway.

Copy link
Contributor

@Nadahar Nadahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through and made a few comments, but there's really nothing of importance here.


private static final String RPC_SRC_PREFIX = "ohshelly-";
private static final int MAX_ID = 10000;
private final AtomicInteger requestId = new AtomicInteger(ThreadLocalRandom.current().nextInt(1, MAX_ID + 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying that it's a problem, I'm just wonder if ThreadLocalRandom was used because it's easier, or because there is some idea that it protects against concurrency issues. If it's just easy/convenient, that's fine, but it won't "protect" anything. The AtomicInteger will protect, all I'm saying is that any source of random would work just as fine.


if (!profile.initialized && profile.alwaysOn) {
getStatus(); // make sure profile.status is initialized (e.g,. relay/meter status)
asyncApiRequest(SHELLYRPC_METHOD_GETSTATUS); // request periodic status updates from device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I don't follow completely, but I'm not even sure that it matters - or that it will be a part of the final code. I can't find any indication that asyncApiRequest will actually execute anything periodically.

config.deviceIp = address;
config.userId = bindingConfig.defaultUserId;
config.password = bindingConfig.defaultPassword;
config.serviceName = serviceName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the serviceName is, but again, probably not important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting other PR Depends on another PR work in progress A PR that is not yet ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[shelly] "duplicate id" warnings in device logs

2 participants