Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668
Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668jschmidt-icinga wants to merge 7 commits intomasterfrom
Conversation
cb64ef1 to
21c2575
Compare
21c2575 to
29f91c9
Compare
d018cae to
a66f9ed
Compare
6391859 to
8ddd29d
Compare
40c5481 to
4acf0b3
Compare
df8e74e to
32639d2
Compare
32639d2 to
a20ccae
Compare
yhabteab
left a comment
There was a problem hiding this comment.
Just some nitpicking, nothing special!
a20ccae to
7909e90
Compare
|
Updates:
|
yhabteab
left a comment
There was a problem hiding this comment.
Here are only 3 comments for the time being but I've an idea, how we can de-duplicate the Send method using a template and will talk about it next week.
| boost::asio::streambuf buf; | ||
| boost::beast::http::async_read(*stream, buf, parser, yc); |
There was a problem hiding this comment.
This is definitely a case for our newly introduced IncomingHttpResponse class. The Parse method doesn't exist yet, so you'll need to add it.
| boost::asio::streambuf buf; | |
| boost::beast::http::async_read(*stream, buf, parser, yc); | |
| IncomingHttpResponse response{stream}; | |
| response.Parse(yc); | |
| if (!response.keep_alive()) { | |
| Disconnect(yc); | |
| } | |
| promise.set_value(response.Parser().release()); |
There was a problem hiding this comment.
Even now I still don't see the point of using these classes. If it doesn't make the code any simpler and I can't use both of them for the symmetry, I'd rather stick with the plain boost::beast types, at least for now. I'll give it one last shot to see if I can use them with a few modifications and now that everything has gotten so much simpler.
yhabteab
left a comment
There was a problem hiding this comment.
I'm done looking with non-test code now, and left aprt from the following question/notice some inline comments as well.
Apologize, if I'm missing something obvious, but your connection handling class doesn't seem to prevent any post Disconnect requests. If I recall correctly, you said that you don't want to reuse the same instance after a disconnect, but I don't see in the codebase that would enforce that behavior. Shouldn't the Send method reject requests when called after the *Disconnect* methods have done their work? Think of a scenario where a writer has a bunch of items in its queue and has currently no working connection. If the writer is paused, it will call StartDisconnectTimeout and then immediately join the work queue, which will wait until all items are processed. In the meantime, the StartDisconnectTimeout will wait for the timeout to expire and then call Disconnect which will close the socket and reset the stream to its state. However, the Send method can still be called while the writer is processing its work queue, and after that there's nothing that would cancel the Send method calls, which will result in the exact same blocking behavior this PR is trying to prevent.
| ReceiveCheckResults(1, ServiceState::ServiceCritical, [](const CheckResult::Ptr& cr) { | ||
| cr->SetOutput(GetRandomString("####", 1024UL * 1024)); | ||
| }); | ||
|
|
||
| // Accept the connection, but don't read from it to leave the client hanging. | ||
| Accept(); | ||
| GetDataUntil("####"); | ||
|
|
||
| // Now try to pause. | ||
| PauseWriter(); |
There was a problem hiding this comment.
This test case doesn't actually test one critical aspect of the pause functionality, which is that it should return immediately (plus minus a 1-2 seconds) after the disconnect timeout expires, even if there is still pending work in the queue. After all, this PR is about making sure that we don't block the shutdown/reload process for an indefinite amount of time, so this should be the most important aspect to test.
There was a problem hiding this comment.
The reasoning is that if a stuck Send() is exited once, it will always return/throw immediately for subsequent queue entries. But I will look into testing that more directly.
There was a problem hiding this comment.
it will always return/throw immediately for subsequent queue entries
Currently, it doesn't immediately reject the requests though, but spawns a new coroutine for each request, then the EnsureConnected function detects whether to proceed or not. This is not ideal, as it's spawning unnecessary coroutines and doing extra work for each request for nothing. I think, it should immediately reject the request if the m_Stopped flag is set, without spawning a new coroutine.
There was a problem hiding this comment.
I've added an early abort to both Send() methods now, but I still haven't found a good way to test with a long queue of multiple blocking Send()s without the unit-tests taking a very long time. Technically perfdata_elasticsearchwriter/pause_with_pending_work does do that, because ElasticsearchWriter sends out the same message for a state-changing check-result twice.
You can leave this conversation open for now and I'm going to revisit this again before we merge this.
c039a10 to
2eff710
Compare
|
Should have addressed most of your comments @yhabteab, unless otherwise noted in a review comment. |
2eff710 to
0a4ad3a
Compare
Also add a Clear() function to clear existing log content.
There's a set of two tests for each perfdatawriter, just to make sure they can connect and send data that looks reasonably correct, and to make sure pausing actually works while the connection is stuck. Then there's a more in-depth suite of tests for PerfdataWriterConnection itself, to verify that connection handling works well in all types of scenarios.
0a4ad3a to
20890d5
Compare
yhabteab
left a comment
There was a problem hiding this comment.
I'm still not sure why the Cancel method can't be merged with Disconnect, other than that and the 2 inline comments below code wise looks fine to me now. Though, I do have to still look at the tests in detail because I've not done that yet. Also would be nice, if we could talk about this in person, and whether it's worth adding a bit more complexity by templating the Send method.
but I've an idea, how we can de-duplicate the
Sendmethod using a template and will talk about it next week.
| @@ -140,66 +140,6 @@ void InfluxdbCommonWriter::ExceptionHandler(boost::exception_ptr exp) | |||
| //TODO: Close the connection, if we keep it open. | |||
There was a problem hiding this comment.
I would say, this TODO can also be removed now??
|
|
||
| IoEngine::SpawnCoroutine(m_Strand, [&](boost::asio::yield_context yc) { | ||
| try { | ||
| Disconnect(std::move(yc)); |
There was a problem hiding this comment.
Just as an idea, you could also change Disconnect to take the yield context by reference instead.
| if (!stream->next_layer().IsVerifyOK()) { | ||
| BOOST_THROW_EXCEPTION(std::runtime_error{"TLS certificate validation failed"}); | ||
| } |
There was a problem hiding this comment.
This now omitted the actual TLS handshake error:
icinga2/lib/perfdata/influxdbcommonwriter.cpp
Lines 192 to 195 in 27c954d
| #include <random> | ||
| #include <sstream> | ||
| #include <boost/test/unit_test.hpp> | ||
| #include <boost/random.hpp> |
There was a problem hiding this comment.
This include seems a leftover from a previous implementation, since you seem to be using a std random generator instead.
| #define CHECK_JOINS_WITHIN(t, timeout) \ | ||
| BOOST_REQUIRE_MESSAGE(t.TryJoinFor(timeout), "Thread not joinable within timeout.") | ||
| #define TEST_JOINS_WITHIN(t, timeout) \ | ||
| BOOST_REQUIRE_MESSAGE(t.TryJoinFor(timeout), "Thread not joinable within timeout.") | ||
|
|
||
| #define REQUIRE_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.") | ||
| #define CHECK_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.") | ||
| #define TEST_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.") |
There was a problem hiding this comment.
These macros don't seem to be used anywhere.
| } | ||
|
|
||
| template<class Rep, class Period> | ||
| bool TryJoinFor(std::chrono::duration<Rep, Period> timeout) |
There was a problem hiding this comment.
The name TryJoinFor doesn't reflect what the function does. It should be renamed to something like TryJoinAfter or TryJoinWithin to better convey its purpose.
| return false; | ||
| } | ||
|
|
||
| bool TryJoin() { return TryJoinFor(std::chrono::milliseconds{0}); } |
| * @param state The state the check results should have | ||
| * @param fn A function that will be passed the current check-result | ||
| */ | ||
| void ReceiveCheckResults( |
There was a problem hiding this comment.
There's another implementation of this for the notification component, so instead of duplicating it here I would outsource this into the utils file and change its arguments to also accept the checkable object.
| */ | ||
| BOOST_AUTO_TEST_CASE(connection_refused) | ||
| { | ||
| CloseAcceptor(); |
There was a problem hiding this comment.
This is superfluous call, since you never started the acceptor.
| ResetSocket(); | ||
| } | ||
|
|
||
| void ResetSocket() |
There was a problem hiding this comment.
I would name this ResetStream just like in the PerfdataWriterConnection class.
| return ret; | ||
| } | ||
|
|
||
| std::size_t ReadRemainingData() |
| ); | ||
| } | ||
|
|
||
| void CloseConnection() |
Description
This unifies the connection handling for all perfdata writers into a single class
PerfdataWriterConnectionthat provides a blocking interface (using promises) to the underlying asynchronous operations.All in all this is a huge code reduction and deduplication (as long as you don't count the added unit-tests) and should fix the issues with the work-queues being stuck on shutdown.
Fixes #10159, possibly fixes #10629
Connection handling
OTLPMetricsWriter#10685. The writers obviously still need to handle the HTTP status codes from the response, which the connection class doesn't touch in any way.Rationale
A simpler solution to the disconnect problem would have been possible. Because a cancelled send or handshake don't allow for a graceful shutdown of the TLS connection anyway, especially when the server is unresponsive, a simple close on the stream's socket would be enough to cancel all outstanding operations. However, many writers only keep temporary stream objects in the functions where the messages are sent and currently don't track the state of the connection, so this would also need some serious refactoring but different for each writer.
Instead of doing the same thing over and over for each writer, I chose to reduce code duplication and abstract the connection handling out of the individual writers and only fix it in one place. Using async operations and an asio strand was convenient, because now every yield leaves the connection object in a defined state, without needing any atomic variables or mutexes, which makes the disconnect handling much simpler.
Other changes
In addition to the changes to connection handling some other minor refactoring has been done:
CheckResultHandlerwhich meant that if a server was slow or unresponsive it could have blocked check-result processing and slowed down the whole process/cluster.Flush()so it could be called from both outside and inside the work queue. This was changed to always queuing theFlush()onto the work queue instead. This makes the behavior more similar to whatInfluxDbCommonWriterdoes. It has been pointed out that in case of an unresponsive writer, this will queue more and more calls toFlush()onto the queue, which shouldn't be a problem because the queue is relatively huge (10000000 items) and if a writer is stuck so long that a 10s flush timer fills up this queue it has since been filled up ten times over by unprocessed messages. It would be relatively easy to fix by just stopping and restarting the timer after each flush has gone through the queue.More refactoring could be done on the HTTP-based writers (InfluxDb and Elasticsearch) in the future. For example they could make use of the new
HttpMessageclasses inremote/httpmessage.hppso they can directly push their objects into the body of a request instead of joining them with new-lines. Both writers could also make use of chunked encoding and stream theirndjsonformatted messages until a timeout expires. I've left that out of this PR because it isn't necessary to fix the underlying issue, but with PerfdataWriterConnection could easily be extended in the future to make this possible.Status
Ready and waiting for reviews. I still have to manually test with the real backends at some point but I don't expect any differences.