-
Notifications
You must be signed in to change notification settings - Fork 102
CSSTUDIO-3113 PVAClient: Accept TCP connections on a separate thread. #3338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like the same problem we had with channel access, epics-base/jca#36 Thoughts from that: When you have an IOC that replies to a search with "Talk to me via TCP on address:port" but then doesn't respond on that TCP address:port, you have a problem. Can't operate like that, need to fix the IOC (file handle resources, ...) and/or the network (firewall). Still, on the client side we need something better than hanging forever. In your tests, does it really hang forever? Or do you see a long connection timeout (minutes...)? For CA, we decided that failing right away is bad because then you enter a tight loop of search, get UDP response, fail to connect via TCP, search, ... Moving the TCP connection to a new thread wouldn't be my first choice. Creates additional threads that we otherwise don't need, plus should still have a connection timeout (with a default of ~5 seconds?). Can you try that? Add a connection timeout, ~5 sec. The screens would then connect somewhat sluggishly, not perfect, but then checking the log will show the connection errors and provide good pointers to the underlying problem, which IOC is failing to allow connections. |
For example, try this:
|
Thanks for the link; it seems to be basically the same issue!
Yes, I think Phoebus should (correctly) show the faulty IOC as being disconnected, but it should be possible to connect to other IOCs without issue, I think.
I tested this (however, with sample size
There are at least two drawbacks of creating threads:
The main benefit would be that it enables independent connections to be established independently and not to interfere with one another.
Thank you for the diff! I tried running with it, and the observed behavior is similar to the behavior I described above, with two exceptions:
|
Before dismissing threading... The issue reported triggering this investigation was an OPI with a large number of PVs, some of which were running on the crippled IOC. At this point there was no clue in the OPI nor in the log how to identify the IOC. Connecting on separate threads would in this case have a better chance to reveal which PVs were actually responsive when the OPI was reloaded to trigger new connection attempts. Agree that threading comes with potential issues, but those can be managed. As for resources... we could consider a common thread pool to support the connection process. Granted, the thread pool would be exhausted if the number of non-responsive PVs is lager than the pool size. |
I think we need to understand this before adding threads and really making it more complicated. With a connection timeout of about 2 seconds, I would accept that some connections to "good" IOCs are delayed but only by about those 2 seconds.
.. and then we enter the loop right away, so overall we're constantly in that 2 second wait-for-timeout that blocks everything else. Can you increase the log level to the point where you see all the search and related details to check if that's what's happening, and if not, what is happening? If the scenario is like that, we need to modify this part
to delay the search. Register the search such that the |
@kasemir I think it's not so easy to understand what is happening based on the log-messages, so I tried to run in a debugger to understand what is going on. Running in a debugger, I can easily see cases where there are already entries in What do you think about modifying this part in
to
If I understand correctly, Perhaps the argument |
I should also add: I think the code eventually does get to search for the other channels, however it apparently may take some time. In |
The search bucket business is complicated. I've added about 40 lines of comments to explain it, but when I re-read that now I'm not 100% sure.
This comment suggests that we are indeed not removing a channel from all buckets for some reason, which might explain what you see:
Maybe that's part of the problem? For failed searches, should we remove the channel from all buckets, then schedule it with a long delay? |
I have a very incomplete understanding of this code, however to me, this source code comment looks incorrect: when I think that it is likely that
Good point; the index must be thought about carefully if this approach is adopted. However, after thinking more about this issue, I think that the implemented logic in Therefore, I believe that a non-blocking approach to establishing the TCP connections is more "correct" than delaying the UDP searches and changing the timeouts on the establishment of TCP connections. One way to achieve this is with threads, as in this pull request. While "normal" OS-level threads are resource-intensive and limited in number, in Java 21 "Virtual Threads" were introduced, which consume much less resources. [1] states:
What do you think about the following two ideas?
|
So if I understand correctly, right now we will only do step 1.
Should this be completed before we create the Phoebus 5 release or is this a unique enough situation that we should not hold up the release. |
I'm for doing two things: Goal is to not hang for a long time. |
Conclusion from ESS point of view should be that it is rare. We've used pva extensively and this was identified only very recently when an IOC got into a deadlock state. |
I agree that this is an acceptable fix. However, I think using virtual threads with a longer timeout than 2 seconds is a more correct fix, and I am slightly more in favor of this solution. |
I feel like I have less that an incomplete understanding of the code ( but that hasn't stopped me from having an opinion) Could we have a setup which works with 2 queues (buckets). |
@shroffk: I think fundamentally the problem here is due to the call to establish a TCP connection being blocking: if the establishment of a TCP connection takes a long time, then all subsequent UDP broadcasts and TCP connection attempts are delayed by that amount of time. I think that a "correct" fix of this issue is to make the TCP connection attempts non-blocking, for instance by running them on "virtual threads" [1] which have low overhead. The idea is to create one virtual thread for each PV. In addition to the problem of TCP connection attempts being blocking, it seems likely that there is an error in the code that results in "extra" connection attempts being made to IOCs that respond over UDP but not over TCP. This exacerbates the problem of the TCP connection attempts being blocking. Luckily, it seems that this can be fixed easily and separately. If you have two threads, then one runs into the same problem of TCP connection attempts being blocking, but on each thread instead. If the two threads could share a queue, then this resolves the problem if there is just one IOC that is not responsive over TCP. However, if there are two IOCs that don't respond over TCP, then one again runs into the same issue. I think the solution is to have the same number of (virtual) threads as one has PVs, since then one doesn't run into this problem. [1] https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html |
Some thoughts:
|
As I understand [1], the idea is to supply an abstraction for threads that is very lightweight and makes constructions such as thread pools unneccessary. For example, [1] states: the inability to spawn very many platform threads—the only implementation of threads available in Java for many years—has bred practices designed to cope with their high cost. These practices are counterproductive when applied to virtual threads, and must be unlearned. Moreover, the vast difference in cost informs a new way of thinking about threads that may be foreign at first. and [1] also states: Blocking a platform thread is expensive because it holds on to the thread—a relatively scarce resource—while it is not doing much meaningful work. Because virtual threads can be plentiful, blocking them is cheap and encouraged. Therefore, you should write code in the straightforward synchronous style and use blocking I/O APIs. To me, this sounds like an abstraction that significantly can simplify much code, as well as lead to increased correctness. Of course, an implementation using virtual threads must be tested first before it is adopted, to see if it really performs well.
In the proposed solution with virtual threads, only the establishment of the TCP connection would be run on separate virtual threads. No new instances of
A future is a primitive used for synchronization, not a means in itself to run code in a non-blocking way. In order for the computation of a future to be computed in a non-blocking way, it still needs to be run on a separate thread.
Probably it could be solved per IP-address, however I think that implementing that correctly is likely to be non-trivial and prone to difficult-to-debug bugs due to the concurrency. [1] https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html |
Step 1, the socket connection timeout, is the most important step to keep us from blocking for a long time.
|
As per discussion last week, would it make sense to put the entire procedure on a (lightweight) thread, i.e. UDP search + TCP connection? |
No. The search messages are not a problem. They are efficiently merged such that one UDP search contains all the channels that are scheduled for being searched in that time slot. The creation of a new TCP socket connection is usually no problem, either, but moving that to a (LW) thread avoids the occasional delays. |
That's a good point and a good question. I don't know the background to this and I don't know the answer to the question. It should surely be considered, I think.
I think it is correct that it needs to be removed. However, this is not the only issue: when re-loading an OPI, the same PV can occur multiple times in I think it's likely that the invariant should be maintained in the code that channels appear at most once in the data structures representing the search state. If we add assertions asserting this in places where the data structures are modified, we should be able to see if this invariant is maintained. |
In regards to core-pva building with JDK 8. I made an issue a while a go to update. #2776 Matlab supports java to 17 now https://www.mathworks.com/support/requirements/language-interfaces.html I don't really know why it's a requirement to load the core-pva library from matlab though? Doesn't p4p satisfy the problem of pvAccess from matlab? And matlab seems to keep that api more up to date (understandably). Isn't this thread getting away from the issue though? @kasemir can't you make some PRs for your suggestions? I think its better to make some changes and test them rather than circling around the most optimal solution. Ideally I think we should have a unit test that reproduces the issue, that we can then create solutions against. |
Thanks for the information. Unfortunately, Virtual Threads were introduced in Java 21 (https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html), but Java 17 is at least much more recent than Java 8.
I don't think it's getting away from the issue: we're still determining the set of issues present in the code.
It's very good to have unit tests, but full correctness of concurrent code is often difficult to test using unit tests. In this case, it seems that connections are eventually made (or not made in the case of unreachable IOCs). We don't only want the code to be only functionally correct (i.e., eventually establish connections), but we want it to implement an intended algorithm correctly to also achieve the desired performance characteristics in all cases. |
As for accommodating Matlab, looks like we don't need to worry about it any longer. The site which was most interested in using Matlab and PVAccess has also specifically not been interested in Java, so P4P might be more palatable anyway. |
step1 for fixing tcp connections delays, based on the discussion #3338
@jacomago: You were right that it can be solved per IP address. In fact, it seems it was already and furthermore it seems that the single-threaded version works when running it multi-threaded: #3348 (comment) |
As stated in the initial comment of this pull request, this pull request was intended as a starting point for a discussion. Since the discussion has resulted in the two pull requests #3345 and #3348, and since no discussion has taken place in this pull request for two weeks, I conclude that this pull request has served its purpose. I therefore close this pull request. |
We have encountered an issue where with an IOC that accepts UDP but drops TCP packets: when loading an OPI that contains a PV on the IOC in question, any PVs that have not been connected to yet show as being disconnected.
It appears that the reason is that the code that connects over TCP after receiving a UDP response to a search query gets stuck waiting for the TCP connection to be established. It appears that this code runs on a single thread, and that as a consequence the TCP connections to subsequent IOCs are not established.
This pull request calls the code to establish the TCP connection on a separate thread. Please note that I do not know this code base, and that this pull request is meant as a starting point for a discussion and perhaps as inspiration for a real fix. In particular, I don't know which parts of the code are thread-safe and in which way. This has to be considered carefully.
I can reproduce the issue by the following steps:
@kasemir, I would like to ask whether you think that this looks like a plausible fix for the error that we have observed?