Skip to content

CSSTUDIO-3113 PVAClient: Accept TCP connections on a separate thread. #3338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

abrahamwolk
Copy link
Collaborator

We have encountered an issue where with an IOC that accepts UDP but drops TCP packets: when loading an OPI that contains a PV on the IOC in question, any PVs that have not been connected to yet show as being disconnected.

It appears that the reason is that the code that connects over TCP after receiving a UDP response to a search query gets stuck waiting for the TCP connection to be established. It appears that this code runs on a single thread, and that as a consequence the TCP connections to subsequent IOCs are not established.

This pull request calls the code to establish the TCP connection on a separate thread. Please note that I do not know this code base, and that this pull request is meant as a starting point for a discussion and perhaps as inspiration for a real fix. In particular, I don't know which parts of the code are thread-safe and in which way. This has to be considered carefully.

I can reproduce the issue by the following steps:

  1. Running a few soft IOCs locally
  2. Creating an OPI with Text Update widgets displaying them
  3. Configuring the firewall to drop TCP packets on the port of one of the soft IOCs
  4. Refreshing the OPI.

@kasemir, I would like to ask whether you think that this looks like a plausible fix for the error that we have observed?

@kasemir
Copy link
Collaborator

kasemir commented Mar 24, 2025

Looks like the same problem we had with channel access, epics-base/jca#36

Thoughts from that:

When you have an IOC that replies to a search with "Talk to me via TCP on address:port" but then doesn't respond on that TCP address:port, you have a problem. Can't operate like that, need to fix the IOC (file handle resources, ...) and/or the network (firewall). Still, on the client side we need something better than hanging forever.

In your tests, does it really hang forever? Or do you see a long connection timeout (minutes...)?

For CA, we decided that failing right away is bad because then you enter a tight loop of search, get UDP response, fail to connect via TCP, search, ...
Dropping the channel from the search means that you'll never connect, even after the IOC gets restarted.
So, for CA, the decision was to implement a configurable TCP connection timeout, https://github.com/epics-base/jca/pull/78/files
I suggested a short default for the timeout, like the 1 second timeout used in the caget command line, but the value of 120 secs was chosen to match the original problem. Good to have a fix that looks exactly like the problem?!

Moving the TCP connection to a new thread wouldn't be my first choice. Creates additional threads that we otherwise don't need, plus should still have a connection timeout (with a default of ~5 seconds?).

Can you try that? Add a connection timeout, ~5 sec. The screens would then connect somewhat sluggishly, not perfect, but then checking the log will show the connection errors and provide good pointers to the underlying problem, which IOC is failing to allow connections.

@kasemir
Copy link
Collaborator

kasemir commented Mar 24, 2025

For example, try this:

--- a/core/pva/src/main/java/org/epics/pva/common/SecureSockets.java
+++ b/core/pva/src/main/java/org/epics/pva/common/SecureSockets.java
@@ -158,12 +158,17 @@ public class SecureSockets
     public static Socket createClientSocket(final InetSocketAddress address, final boolean tls) throws Exception
     {
         initialize();
+        final int timeout_ms = 2000; // TODO get from a new PVASettings.EPICS_PVA_...
         if (! tls)
-            return new Socket(address.getAddress(), address.getPort());
-
+        {
+               Socket s = new Socket();
+               s.connect(address,  timeout_ms);
+            return s;
+        }
         if (tls_client_sockets == null)
             throw new Exception("TLS is not supported. Configure EPICS_PVA_TLS_KEYCHAIN");
-        final SSLSocket socket = (SSLSocket) tls_client_sockets.createSocket(address.getAddress(), address.getPort());
+        final SSLSocket socket = (SSLSocket) tls_client_sockets.createSocket();
+        socket.connect(address, timeout_ms);
         socket.setEnabledProtocols(PROTOCOLS);
         // Handshake starts when first writing, but that might delay SSL errors, so force handshake before we use the socket
         socket.startHandshake();

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Mar 25, 2025

Looks like the same problem we had with channel access, epics-base/jca#36

Thanks for the link; it seems to be basically the same issue!

Thoughts from that:

When you have an IOC that replies to a search with "Talk to me via TCP on address:port" but then doesn't respond on that TCP address:port, you have a problem. Can't operate like that, need to fix the IOC (file handle resources, ...) and/or the network (firewall). Still, on the client side we need something better than hanging forever.

Yes, I think Phoebus should (correctly) show the faulty IOC as being disconnected, but it should be possible to connect to other IOCs without issue, I think.

In your tests, does it really hang forever? Or do you see a long connection timeout (minutes...)?

I tested this (however, with sample size n=1), and it seems to depend: on the terminal, timeouts are printed approximately every 2m15s; when first starting Phoebus, connections to the other (responsive) PVs was established after some minutes. However, after reloading my test OPI, for over 40 minutes no connections to responsive PVs were established (i.e., all connections are blocked by the PV that is not responsive). Timouts are still printed to the terminal every 2m15s, approximately.

For CA, we decided that failing right away is bad because then you enter a tight loop of search, get UDP response, fail to connect via TCP, search, ... Dropping the channel from the search means that you'll never connect, even after the IOC gets restarted. So, for CA, the decision was to implement a configurable TCP connection timeout, https://github.com/epics-base/jca/pull/78/files I suggested a short default for the timeout, like the 1 second timeout used in the caget command line, but the value of 120 secs was chosen to match the original problem. Good to have a fix that looks exactly like the problem?!

Moving the TCP connection to a new thread wouldn't be my first choice. Creates additional threads that we otherwise don't need, plus should still have a connection timeout (with a default of ~5 seconds?).

There are at least two drawbacks of creating threads:

  1. Overhead in creating threads.
  2. It requires careful consideration of the concurrency.

The main benefit would be that it enables independent connections to be established independently and not to interfere with one another.

Can you try that? Add a connection timeout, ~5 sec. The screens would then connect somewhat sluggishly, not perfect, but then checking the log will show the connection errors and provide good pointers to the underlying problem, which IOC is failing to allow connections.

Thank you for the diff! I tried running with it, and the observed behavior is similar to the behavior I described above, with two exceptions:

  1. As expected, the timeout is now printed to the terminal every two seconds instead of every 2m15s.
  2. When starting Phoebus, connections to the responsive PVs seem to be established quickly after the first timeout has occurred. However, after reloading the OPI, the connections seem to be established more slowly. After reloading the OPI several times, it seems to take longer and longer. Can it be that it has put the faulty IOC several times in the queue of IOCs to connect to?

@georgweiss
Copy link
Collaborator

Before dismissing threading...

The issue reported triggering this investigation was an OPI with a large number of PVs, some of which were running on the crippled IOC. At this point there was no clue in the OPI nor in the log how to identify the IOC. Connecting on separate threads would in this case have a better chance to reveal which PVs were actually responsive when the OPI was reloaded to trigger new connection attempts.

Agree that threading comes with potential issues, but those can be managed. As for resources... we could consider a common thread pool to support the connection process. Granted, the thread pool would be exhausted if the number of non-responsive PVs is lager than the pool size.

@kasemir
Copy link
Collaborator

kasemir commented Mar 25, 2025

...diff .. tried running with it, and the observed behavior is similar to the behavior I described above

I think we need to understand this before adding threads and really making it more complicated.

With a connection timeout of about 2 seconds, I would accept that some connections to "good" IOCs are delayed but only by about those 2 seconds.
Maybe the issue here is this:

  1. Sending UDP search
  2. Getting UDP reply
  3. Attempting TCP connection
  4. TCP times out after 2 seconds
  5. Re-schedule the search

.. and then we enter the loop right away, so overall we're constantly in that 2 second wait-for-timeout that blocks everything else.

Can you increase the log level to the point where you see all the search and related details to check if that's what's happening, and if not, what is happening?

If the scenario is like that, we need to modify this part

                search.register(channel, false /* not "now" but eventually */);

to delay the search. Register the search such that the SearchedChannel's search_period is already at say half the MAX_SEARCH_PERIOD.

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Mar 26, 2025

@kasemir I think it's not so easy to understand what is happening based on the log-messages, so I tried to run in a debugger to understand what is going on. Running in a debugger, I can easily see cases where there are already entries in search_buckets for indices equal to greater than bucket + SEARCH_SOON_DELAY. (There's also entries in search_buckets for entries less than bucket + SEARCH_SOON_DELAY.)

What do you think about modifying this part in register() in the class ChannelSearch

            if (!now)
                bucket = (bucket + SEARCH_SOON_DELAY)  % search_buckets.size();

to

            if (!now)
                bucket = (bucket - 1)  % search_buckets.size();

If I understand correctly, search_buckets is essentially a ring-buffer, and by subtracting 1 from the bucket number, the channel would be put almost as far back as possible (except for other channels in the same bucket that it is placed into, which may be searched afterwards).

Perhaps the argument now should also be more fine-grained: perhaps there should be options for search now, search soon, search after all other currently active searches have finished?

@abrahamwolk
Copy link
Collaborator Author

I should also add: I think the code eventually does get to search for the other channels, however it apparently may take some time. In search_buckets the PV that times out is represented multiple times, while the other PVs are represented only one time. Should PVs that cannot be connected to be represented several times in search_buckets, or is this a bug?

@kasemir
Copy link
Collaborator

kasemir commented Mar 26, 2025

The search bucket business is complicated. I've added about 40 lines of comments to explain it, but when I re-read that now I'm not 100% sure.

the PV that times out is represented multiple times

This comment suggests that we are indeed not removing a channel from all buckets for some reason, which might explain what you see:

    public PVAChannel unregister(final int channel_id)
    {
...
            // NOT removing `searched` from all `search_buckets`.
            // .. reasons...

Maybe that's part of the problem? For failed searches, should we remove the channel from all buckets, then schedule it with a long delay?
Not sure if the longest delay means
bucket = (bucket - 1) % search_buckets.size(); or if that requires a "-2", see comments about MAX_SEARCH_PERIOD

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Mar 27, 2025

    public PVAChannel unregister(final int channel_id)
    {
...
            // NOT removing `searched` from all `search_buckets`.
            // .. reasons...

I have a very incomplete understanding of this code, however to me, this source code comment looks incorrect: when register() adds the PV back into search_buckets at index bucket + SEARCH_SOON_DELAY, it also just before that adds back the channel to searched_channels. It seems to me that the logic mentioned under .. reasons .. therefore is incorrect: .. reasons .. includes the statement that runSearches() will drop the channel from search_buckets because it's no longer listed in searched_channels. However, I think this will not be true for the occurrences of the PV at indices before bucket + SEARCH_SOON_DELAY when the PV has been added back to search_buckets after a failed connection attempt, since register() (which registers the PV again for search in bucket bucket + SEARCH_SOON_DELAY) will add the PV back to searched.

I think that it is likely that searched should be removed from search_buckets in unregister().

Maybe that's part of the problem? For failed searches, should we remove the channel from all buckets, then schedule it with a long delay? Not sure if the longest delay means bucket = (bucket - 1) % search_buckets.size(); or if that requires a "-2", see comments about MAX_SEARCH_PERIOD

Good point; the index must be thought about carefully if this approach is adopted. However, after thinking more about this issue, I think that the implemented logic in ChannelSearch has more to do with the UDP searches for PVs than with the establishment of the subsequent TCP connections. However, I think that the code is written under the assumption that esablishing a TCP connection will succeed quickly after a UDP response has been obtained, and when it doesn't, the code doesn't behave well.

Therefore, I believe that a non-blocking approach to establishing the TCP connections is more "correct" than delaying the UDP searches and changing the timeouts on the establishment of TCP connections. One way to achieve this is with threads, as in this pull request. While "normal" OS-level threads are resource-intensive and limited in number, in Java 21 "Virtual Threads" were introduced, which consume much less resources. [1] states:

Virtual threads are Java threads that are implemented by the Java runtime rather than the OS. The main difference between virtual threads and the traditional threads—which we've come to call platform threads—is that we can easily have a great many active virtual threads, even millions, running in the same Java process.

What do you think about the following two ideas?

  1. Implement a removal of searched from search_buckets in unregister().
  2. When we have adopted Java 21 for the project, we try to implement the establishment of a TCP connection similarly as in the current pull request but using Virtual Threads instead.

[1] https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-8AEDDBE6-F783-4D77-8786-AC5A79F517C0

@shroffk
Copy link
Member

shroffk commented Mar 27, 2025

So if I understand correctly, right now we will only do step 1.

Implement a removal of searched from search_buckets in unregister().

Should this be completed before we create the Phoebus 5 release or is this a unique enough situation that we should not hold up the release.

@kasemir
Copy link
Collaborator

kasemir commented Mar 27, 2025

I'm for doing two things:
Add a configurable timeout to the connection,
and removal of searched from search_buckets in unregister().

Goal is to not hang for a long time.
A short connection delay from those ~2 sec timeouts is acceptable at times where your network is messed up.

@georgweiss
Copy link
Collaborator

georgweiss commented Mar 27, 2025

Conclusion from ESS point of view should be that it is rare. We've used pva extensively and this was identified only very recently when an IOC got into a deadlock state.

@abrahamwolk
Copy link
Collaborator Author

I'm for doing two things: Add a configurable timeout to the connection, and removal of searched from search_buckets in unregister().

Goal is to not hang for a long time. A short connection delay from those ~2 sec timeouts is acceptable at times where your network is messed up.

I agree that this is an acceptable fix. However, I think using virtual threads with a longer timeout than 2 seconds is a more correct fix, and I am slightly more in favor of this solution.

@shroffk
Copy link
Member

shroffk commented Mar 27, 2025

I feel like I have less that an incomplete understanding of the code ( but that hasn't stopped me from having an opinion)
A lot of our IOC's have hundreds of PV's which are all usually displayed together... when we open one of those screens and the IOC is not behaving correctly having a few more threads might not significantly mitigate the problem.

Could we have a setup which works with 2 queues (buckets).
The priority queue in which all new PV connection request start of... if they fail because the name resolution did not result in a reply or the TCP connection failed even after a successful name resolution then we move those PV's into the secondary queue.
We always try to connect to all the PV's in the primary queue first and only then move to the secondary queue.
Maybe the schedule with a long delay achieves the same.

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Mar 27, 2025

@shroffk: I think fundamentally the problem here is due to the call to establish a TCP connection being blocking: if the establishment of a TCP connection takes a long time, then all subsequent UDP broadcasts and TCP connection attempts are delayed by that amount of time.

I think that a "correct" fix of this issue is to make the TCP connection attempts non-blocking, for instance by running them on "virtual threads" [1] which have low overhead. The idea is to create one virtual thread for each PV.

In addition to the problem of TCP connection attempts being blocking, it seems likely that there is an error in the code that results in "extra" connection attempts being made to IOCs that respond over UDP but not over TCP. This exacerbates the problem of the TCP connection attempts being blocking. Luckily, it seems that this can be fixed easily and separately.

If you have two threads, then one runs into the same problem of TCP connection attempts being blocking, but on each thread instead. If the two threads could share a queue, then this resolves the problem if there is just one IOC that is not responsive over TCP. However, if there are two IOCs that don't respond over TCP, then one again runs into the same issue. I think the solution is to have the same number of (virtual) threads as one has PVs, since then one doesn't run into this problem.

[1] https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html

@jacomago
Copy link
Contributor

Some thoughts:

  • Archiver Appliance uses a thread pool for the PVs, @kasemir I believe you added it. And Murali recently rewrote it. So it has an interest in the solution to this problem.
  • When testing the phoebus pvAccess code in the archiver I came to a similar problem as to this when working with a gateway. So would be nice if it can be solved in the library. I'm not sure the time out solution is a good solution in the Archiver, we don't want to stop archiving X PVs while we wait for Y PVs to timeout. So I think making this library non-blocking would be better. My only solution at the time had been to create a PVAClient per PV, but on testing that made the resources skyrocket so I decided to hope that this locking issue was infrequent without a gateway.
  • I'm not sure virtual threads are necessary to make this non-blocking. Can we not use a ComputableFuture for this problem?
  • Does this problem need to be tackled per PV? Or can we solve it per IOC? I don't know if it's possible to have a thread per ip-address being connected to?

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Mar 28, 2025

Archiver Appliance uses a thread pool for the PVs, @kasemir I believe you added it. And Murali recently rewrote it. So it has an interest in the solution to this problem.

As I understand [1], the idea is to supply an abstraction for threads that is very lightweight and makes constructions such as thread pools unneccessary.

For example, [1] states:

the inability to spawn very many platform threads—the only implementation of threads available in Java for many years—has bred practices designed to cope with their high cost. These practices are counterproductive when applied to virtual threads, and must be unlearned. Moreover, the vast difference in cost informs a new way of thinking about threads that may be foreign at first.

and [1] also states:

Blocking a platform thread is expensive because it holds on to the thread—a relatively scarce resource—while it is not doing much meaningful work. Because virtual threads can be plentiful, blocking them is cheap and encouraged. Therefore, you should write code in the straightforward synchronous style and use blocking I/O APIs.

To me, this sounds like an abstraction that significantly can simplify much code, as well as lead to increased correctness.

Of course, an implementation using virtual threads must be tested first before it is adopted, to see if it really performs well.

When testing the phoebus pvAccess code in the archiver I came to a similar problem as to this when working with a gateway. So would be nice if it can be solved in the library. I'm not sure the time out solution is a good solution in the Archiver, we don't want to stop archiving X PVs while we wait for Y PVs to timeout. So I think making this library non-blocking would be better. My only solution at the time had been to create a PVAClient per PV, but on testing that made the resources skyrocket so I decided to hope that this locking issue was infrequent without a gateway.

In the proposed solution with virtual threads, only the establishment of the TCP connection would be run on separate virtual threads. No new instances of PVAClient would be created.

I'm not sure virtual threads are necessary to make this non-blocking. Can we not use a ComputableFuture for this problem?

A future is a primitive used for synchronization, not a means in itself to run code in a non-blocking way. In order for the computation of a future to be computed in a non-blocking way, it still needs to be run on a separate thread.

Does this problem need to be tackled per PV? Or can we solve it per IOC? I don't know if it's possible to have a thread per ip-address being connected to?

Probably it could be solved per IP-address, however I think that implementing that correctly is likely to be non-trivial and prone to difficult-to-debug bugs due to the concurrency.

[1] https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html

@kasemir
Copy link
Collaborator

kasemir commented Apr 1, 2025

Step 1, the socket connection timeout, is the most important step to keep us from blocking for a long time.
Once that's merged, the next step should be twofold:
In PVAClient.handleSearchResponse, move the code from tcp_handlers.computeIfAbsent.. on into a thread, as @abrahamwolk suggested in this PR. Instead of plain threads, lightweight threads do make sense, except they're a new concept. We have for now allowed building core-pva with JDK 8, because that's what Matlab required. Maybe that no longer matters?
The other part of the second step would be in ChannelSearch.unregister. It really needs to remove the channel from all buckets, because otherwise, after a TCP connection failure, the re-search will happen to frequently:

public PVAChannel unregister(final int channel_id)
{
    final SearchedChannel searched = searched_channels.remove(channel_id);
    if (searched != null)
    {
        logger.log(Level.FINE, () -> "Unregister search for " + searched.channel.getName() + " " + channel_id);
        synchronized (search_buckets)
        {   // Remove from all buckets!!
            for (LinkedList<SearchedChannel> bucket : search_buckets)
                bucket.remove(searched);
        } 
        return searched.channel;
    }
    return null;
}

@georgweiss
Copy link
Collaborator

georgweiss commented Apr 1, 2025

As per discussion last week, would it make sense to put the entire procedure on a (lightweight) thread, i.e. UDP search + TCP connection?

@kasemir
Copy link
Collaborator

kasemir commented Apr 1, 2025

No. The search messages are not a problem. They are efficiently merged such that one UDP search contains all the channels that are scheduled for being searched in that time slot. The creation of a new TCP socket connection is usually no problem, either, but moving that to a (LW) thread avoids the occasional delays.

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Apr 2, 2025

We have for now allowed building core-pva with JDK 8, because that's what Matlab required. Maybe that no longer matters?

That's a good point and a good question. I don't know the background to this and I don't know the answer to the question. It should surely be considered, I think.

The other part of the second step would be in ChannelSearch.unregister. It really needs to remove the channel from all buckets, because otherwise, after a TCP connection failure, the re-search will happen to frequently:

I think it is correct that it needs to be removed. However, this is not the only issue: when re-loading an OPI, the same PV can occur multiple times in search_buckets, and with different channel IDs. I believe that the issue is that search_bucketsand searched_channels are not cleared when re-loading an OPI. Maybe there's another issue though, or maybe there's multiple ones.

I think it's likely that the invariant should be maintained in the code that channels appear at most once in the data structures representing the search state. If we add assertions asserting this in places where the data structures are modified, we should be able to see if this invariant is maintained.

@jacomago
Copy link
Contributor

jacomago commented Apr 2, 2025

In regards to core-pva building with JDK 8. I made an issue a while a go to update. #2776 Matlab supports java to 17 now https://www.mathworks.com/support/requirements/language-interfaces.html

I don't really know why it's a requirement to load the core-pva library from matlab though? Doesn't p4p satisfy the problem of pvAccess from matlab? And matlab seems to keep that api more up to date (understandably).

Isn't this thread getting away from the issue though? @kasemir can't you make some PRs for your suggestions? I think its better to make some changes and test them rather than circling around the most optimal solution. Ideally I think we should have a unit test that reproduces the issue, that we can then create solutions against.

@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Apr 2, 2025

Matlab supports java to 17 now

Thanks for the information. Unfortunately, Virtual Threads were introduced in Java 21 (https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html), but Java 17 is at least much more recent than Java 8.

Isn't this thread getting away from the issue though?

I don't think it's getting away from the issue: we're still determining the set of issues present in the code.

Ideally I think we should have a unit test that reproduces the issue, that we can then create solutions against.

It's very good to have unit tests, but full correctness of concurrent code is often difficult to test using unit tests. In this case, it seems that connections are eventually made (or not made in the case of unreachable IOCs). We don't only want the code to be only functionally correct (i.e., eventually establish connections), but we want it to implement an intended algorithm correctly to also achieve the desired performance characteristics in all cases.

@kasemir
Copy link
Collaborator

kasemir commented Apr 2, 2025

As for accommodating Matlab, looks like we don't need to worry about it any longer. The site which was most interested in using Matlab and PVAccess has also specifically not been interested in Java, so P4P might be more palatable anyway.

shroffk added a commit that referenced this pull request Apr 2, 2025
step1 for fixing tcp connections delays, based on the discussion #3338
@abrahamwolk
Copy link
Collaborator Author

abrahamwolk commented Apr 3, 2025

Does this problem need to be tackled per PV? Or can we solve it per IOC? I don't know if it's possible to have a thread per ip-address being connected to?

Probably it could be solved per IP-address, however I think that implementing that correctly is likely to be non-trivial and prone to difficult-to-debug bugs due to the concurrency.

@jacomago: You were right that it can be solved per IP address. In fact, it seems it was already and furthermore it seems that the single-threaded version works when running it multi-threaded: #3348 (comment)

@abrahamwolk
Copy link
Collaborator Author

As stated in the initial comment of this pull request, this pull request was intended as a starting point for a discussion.

Since the discussion has resulted in the two pull requests #3345 and #3348, and since no discussion has taken place in this pull request for two weeks, I conclude that this pull request has served its purpose. I therefore close this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants