-
Notifications
You must be signed in to change notification settings - Fork 102
Move TCP connection to thread, fully unregister completed search #3348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@abrahamwolk This is basically what you had, moving the connection to a thread. Plus the searches are all cancelled, and we'll have a shorter timeout. If you test this, it might "work". It performs alright when I try it. Assume you have channels A and B both on the same server. Client now handles BOTH replies on separate threads.
Is that fine, the
|
Playing with a simplified test setup, it looks like the concurrent hash map and Thread 1 calling computeIfAbsent for server A and then stuck there waiting for an eventual timeout So being stuck inside the computeIfAbsent lambda during a connection issue will only affect those threads that are looking for that server, while threads for other servers can continue. |
Very nice! The official documentation [1] also states (I have emphasized by making part of the text bold): If the specified key is not already associated with a value, attempts to compute its value using the given mapping function and enters it into this map unless null. The entire method invocation is performed atomically. The supplied function is invoked exactly once per invocation of this method if the key is absent, else not at all. Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple. The mapping function must not modify this map during computation. which sounds correct. (I edited this comment to update to the documentation for Java 17, the version of Java we are using when writing this.) |
{ | ||
try | ||
final ClientTCPHandler tcp = tcp_handlers.computeIfAbsent(server, addr -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the documentation [1] of ConcurrentHashMap.computeIfAbsent()
more carefully, it seems that it is not guaranteed that all other updates are not blocked:
Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple.
In fact, I seem to get connections that are blocked until a timeout occurs on the establishment of one connection, before the establishment of another connection.
This comment suggests the same: https://stackoverflow.com/a/78230808
I think it's worth thinking about in more detail if we can make the implementation entirely non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I'm thinking that tcp_handlers could change from ConcurrentHashMap<server, ClientTCPHandler>
to ConcurrentHashMap<server, Future<ClientTCPHandler>>
. So it would return the Future<ClientTCPHandler>
without delay. The thread might then time out while waiting for that Future
to complete, but the computeIfAbsent
is immediate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds like a good idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now using ConcurrentHashMap<server, Future<ClientTCPHandler>>
, need to test that a little
synchronized (search_buckets) | ||
{ | ||
for (LinkedList<SearchedChannel> bucket : search_buckets) | ||
bucket.remove(searched); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LinkedList.remove()
only removes the first occurrence of an element. Can an element occur multiple times in a bucket
, and if so, should all the occurrences of the element be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boost
only adds it once, so needs to be like this in all places that add?
if (! bucket.contains(searched))
bucket.add(searched);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I wasn't aware it was checked when it was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's checked in one place, need to see if it's enforced in all places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perhaps outside the scope of this pull request, but if we change the type of bucket
to be a Set
[1], then we can likely get both
- More efficient search and removal of elements.
- Guarantee that elements occur only at most once.
If the order of insertion is important, an implementation like, e.g., LinkedHashSet
could be used.
(Edited to update the link to point to the documentation of Java version 17.)
[1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/Set.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's now a Set
@@ -250,32 +250,40 @@ void handleSearchResponse(final int channel_id, final InetSocketAddress server, | |||
channel.setState(ClientChannelState.FOUND); | |||
logger.log(Level.FINE, () -> "Reply for " + channel + " from " + (tls ? "TLS " : "TCP ") + server + " " + guid); | |||
|
|||
final ClientTCPHandler tcp = tcp_handlers.computeIfAbsent(server, addr -> | |||
// TCP connection can be slow, especially when blocked by firewall, so move to thread | |||
// TODO Lightweight thread? Thread pool? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be hesitant to create OS-level threads for each connection attempt. I think it's not unreasonable for an OPI to contain on the order of 100 or even 1000 PVs, and I am not sure creating that many OS-level threads is a good idea. I suggest that we wait until we have adopted Java 21 and then create Virtual Threads instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think adding a thread pool would be good.
I'm not actually sure it would be 1000 threads per OPI. I believe there is some tcp connection sharing, the case that did the crash was more the number of IOCs per OPI, which I think approaches closer to 100 max than 1000.
The archiver however.... that's a lot of threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not an upgrade to Java 21, saves us discussions on thread pools...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a thread per TCP connection, and one TCP connection per IP:port.
Now using virtual thread.
Runs with JDK 20 when using --enable-preview
or of course JDK 21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. but the GitHub build now fails because it's not using JDK 21:
/home/runner/work/phoebus/phoebus/core/pva/src/main/java/org/epics/pva/client/PVAClient.java:[260,14] error: cannot find symbol
Error: symbol: method ofVirtual()
This way, `tcp_handlers` can provide the `Future` without delays, while the slower TCP connection is then awaited when getting the future's value
@@ -250,32 +256,52 @@ void handleSearchResponse(final int channel_id, final InetSocketAddress server, | |||
channel.setState(ClientChannelState.FOUND); | |||
logger.log(Level.FINE, () -> "Reply for " + channel + " from " + (tls ? "TLS " : "TCP ") + server + " " + guid); | |||
|
|||
final ClientTCPHandler tcp = tcp_handlers.computeIfAbsent(server, addr -> | |||
// TCP connection can be slow, especially when blocked by firewall, so move to thread | |||
Thread.ofVirtual() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative to this way of implementing the functionality, would be to create the thread only for the computation that computes the value of the Future
. That way, there will be fewer threads.
I'm thinking of something along the lines of:
final Future<ClientTCPHandler> tcp_future = tcp_handlers.computeIfAbsent(server, addr ->
{
final CompletableFuture<ClientTCPHandler> create_tcp = new CompletableFuture<>();
// Attempt the TCP connection on a separate virtual thread:
Thread.ofVirtual().name("TCP connect " + server)
.start(() ->
{
try {
var client_tcp_handler = new ClientTCPHandler(this, addr, guid, tls);
create_tcp.complete(client_tcp_handler);
}
catch (Exception ex)
{
logger.log(Level.WARNING, "Cannot connect to TCP " + addr, ex);
}
create_tcp.complete(null);
});
return create_tcp;
});
This is just an idea to discuss. (Also, I have not compiled this code, it's just a sketch.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Go ahead and update the branch like that, since this also addresses your other concern about CompletableFuture.completeAsync
using OS threads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this idea now and pushed the implementation to the branch of this pull request. (Commit: f982551)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the conclusion is that it doesn't seem that we can avoid creating many threads this way, but at least we can create virtual threads instead of OS-level threads.
{ | ||
try | ||
final CompletableFuture<ClientTCPHandler> create_tcp = new CompletableFuture<>(); | ||
create_tcp.completeAsync(() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, will this not spawn an OS-level thread? The documentation of CompletableFuture.completeAsync()
[1] states:
Completes this CompletableFuture with the result of the given Supplier function invoked from an asynchronous task using the default executor.
…en calling tcp_future.get().
You end up with several instances of "LocalPV2", each with a different client ID? Maybe try to trap that by adding something like this to
|
One thing I'm doing is re-loading the OPI many times, so maybe it's related to that. Are the data-structures for searching for PVs cleared when reloading OPIs? I added |
|
Yes. I will continue to debug this. |
…nizing on 'search_buckets'.
…ashMap<> to HashMap<>.
I am able to reproduce the bug by:
In order to trigger the bug, it is helpful to merge in the While I have not managed to determine the exact cause of the bug, I no longer encounter the bug if I add the keyword With the changes I added in 31ccb76, there were two separate approaches to locking
While it seems that the two approaches can co-exist, I think it's not optimal, as it makes the code harder to read and reason about. In order to facilitate reasoning, I therefore removed the old mechanism (i.e., the mechanism under point 1), so that the code instead uses only mechanism 2, which seems to me to be easier to reason about. I have implemented this in the two commits 971903c and bcbc14f. |
In the current pull request, on line 293 of Should this call be moved into the computation that tries to establish the TCP connection? (I.e., into the "inner" virtual thread that computes the value of the It seems to me that by being called in the "outer" virtual thread, the call to EDIT: Since I believe this is correct, I have implemented it in the commit 5df73f6. |
…blish the TCP connection.
My leaning would be to avoid wholesale |
No, unfortunately it doesn't solve the issue. |
OK, where are we now with this?
As far as we can tell from looking at it and tests, this is an improvement, but the virtual threads require JDK 21, causing the automated build to fail. Now what? |
To me, this pull request also looks like an improvement. I propose to first upgrade Spring Boot and JDK, and then to merge this pull request. |
Next step after #3345