Skip to content

Comments

Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668

Open
jschmidt-icinga wants to merge 7 commits intomasterfrom
perfdata-writers-connection-handling
Open

Add PerfdatawriterConnection to handle network requests for Perfdata Writers#10668
jschmidt-icinga wants to merge 7 commits intomasterfrom
perfdata-writers-connection-handling

Conversation

@jschmidt-icinga
Copy link
Contributor

@jschmidt-icinga jschmidt-icinga commented Dec 10, 2025

Description

This unifies the connection handling for all perfdata writers into a single class PerfdataWriterConnection that provides a blocking interface (using promises) to the underlying asynchronous operations.

All in all this is a huge code reduction and deduplication (as long as you don't count the added unit-tests) and should fix the issues with the work-queues being stuck on shutdown.

Fixes #10159, possibly fixes #10629

Connection handling

  • Connections are now established lazily whenever a message is being sent to the server. Some writers already worked that way, while others connected at the start and kept their connections around for as long as they needed them or until the server disconected.
  • HTTP-based writers will disconnect after sending a message and receiving the response unless the keep-alive flag is set by the server. Currently we do not request keep-alive on our side, but that could easily be done on the side of the writers if we want to.
  • A disconnect timeout can be started where after it expires the connection will be disconnected and enter a state where no further attempts will be made at reconnecting. When the timeout expires it will also cancel all outstanding operations, including a slow/unresponsive send and TLS handshake.
  • All system errors are handled by the connection class internally and lead to a retry after an exponentially increasing timeout similar to the backoff strategy implemented by Add OTLPMetricsWriter #10685. The writers obviously still need to handle the HTTP status codes from the response, which the connection class doesn't touch in any way.

Rationale

A simpler solution to the disconnect problem would have been possible. Because a cancelled send or handshake don't allow for a graceful shutdown of the TLS connection anyway, especially when the server is unresponsive, a simple close on the stream's socket would be enough to cancel all outstanding operations. However, many writers only keep temporary stream objects in the functions where the messages are sent and currently don't track the state of the connection, so this would also need some serious refactoring but different for each writer.

Instead of doing the same thing over and over for each writer, I chose to reduce code duplication and abstract the connection handling out of the individual writers and only fix it in one place. Using async operations and an asio strand was convenient, because now every yield leaves the connection object in a defined state, without needing any atomic variables or mutexes, which makes the disconnect handling much simpler.

Other changes

In addition to the changes to connection handling some other minor refactoring has been done:

  • OpenTsdbWriter now also uses a work queue like all the other writers. Previously this writer would send its data directly in its CheckResultHandler which meant that if a server was slow or unresponsive it could have blocked check-result processing and slowed down the whole process/cluster.
  • ElasticsearchWriter locked a mutex on each Flush() so it could be called from both outside and inside the work queue. This was changed to always queuing the Flush() onto the work queue instead. This makes the behavior more similar to what InfluxDbCommonWriter does. It has been pointed out that in case of an unresponsive writer, this will queue more and more calls to Flush() onto the queue, which shouldn't be a problem because the queue is relatively huge (10000000 items) and if a writer is stuck so long that a 10s flush timer fills up this queue it has since been filled up ten times over by unprocessed messages. It would be relatively easy to fix by just stopping and restarting the timer after each flush has gone through the queue.

More refactoring could be done on the HTTP-based writers (InfluxDb and Elasticsearch) in the future. For example they could make use of the new HttpMessage classes in remote/httpmessage.hpp so they can directly push their objects into the body of a request instead of joining them with new-lines. Both writers could also make use of chunked encoding and stream their ndjson formatted messages until a timeout expires. I've left that out of this PR because it isn't necessary to fix the underlying issue, but with PerfdataWriterConnection could easily be extended in the future to make this possible.

Status

Ready and waiting for reviews. I still have to manually test with the real backends at some point but I don't expect any differences.

@cla-bot cla-bot bot added the cla/signed label Dec 10, 2025
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 9 times, most recently from cb64ef1 to 21c2575 Compare December 12, 2025 07:45
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 21c2575 to 29f91c9 Compare December 15, 2025 11:08
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 2 times, most recently from d018cae to a66f9ed Compare December 17, 2025 14:15
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 4 times, most recently from 6391859 to 8ddd29d Compare January 28, 2026 11:35
@jschmidt-icinga jschmidt-icinga added this to the 2.16.0 milestone Jan 29, 2026
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 8 times, most recently from 40c5481 to 4acf0b3 Compare February 4, 2026 07:50
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 4 times, most recently from df8e74e to 32639d2 Compare February 6, 2026 12:34
@jschmidt-icinga jschmidt-icinga changed the title (WIP) Add PerfdatawriterConnection to handle network requests for Perfdata Writers Add PerfdatawriterConnection to handle network requests for Perfdata Writers Feb 6, 2026
@jschmidt-icinga jschmidt-icinga marked this pull request as ready for review February 6, 2026 12:35
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 32639d2 to a20ccae Compare February 6, 2026 13:24
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicking, nothing special!

@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from a20ccae to 7909e90 Compare February 12, 2026 13:19
@jschmidt-icinga
Copy link
Contributor Author

Updates:

  • Simplified connection state tracking. Most of it was from an earlier iteration and was no longer necessary. I've also removed the special handling of the initial state to skip the timeout. This was added when I didn't have the handlers call a "manual" disconnect after the work queue is finished anyway.
  • Increased the default for the disconnect timeout to 10s, up from the 0.5s it was before (we can change this any time and it is configurable anyway).
  • Made the flush timer on ElasticsearchWriter and InfluxDb*Writer restart itself only when the flush has gone through the work queue.
  • Other improvements suggested by @yhabteab above.

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are only 3 comments for the time being but I've an idea, how we can de-duplicate the Send method using a template and will talk about it next week.

Comment on lines 91 to 92
boost::asio::streambuf buf;
boost::beast::http::async_read(*stream, buf, parser, yc);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a case for our newly introduced IncomingHttpResponse class. The Parse method doesn't exist yet, so you'll need to add it.

Suggested change
boost::asio::streambuf buf;
boost::beast::http::async_read(*stream, buf, parser, yc);
IncomingHttpResponse response{stream};
response.Parse(yc);
if (!response.keep_alive()) {
Disconnect(yc);
}
promise.set_value(response.Parser().release());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even now I still don't see the point of using these classes. If it doesn't make the code any simpler and I can't use both of them for the symmetry, I'd rather stick with the plain boost::beast types, at least for now. I'll give it one last shot to see if I can use them with a few modifications and now that everything has gotten so much simpler.

Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm done looking with non-test code now, and left aprt from the following question/notice some inline comments as well.

Apologize, if I'm missing something obvious, but your connection handling class doesn't seem to prevent any post Disconnect requests. If I recall correctly, you said that you don't want to reuse the same instance after a disconnect, but I don't see in the codebase that would enforce that behavior. Shouldn't the Send method reject requests when called after the *Disconnect* methods have done their work? Think of a scenario where a writer has a bunch of items in its queue and has currently no working connection. If the writer is paused, it will call StartDisconnectTimeout and then immediately join the work queue, which will wait until all items are processed. In the meantime, the StartDisconnectTimeout will wait for the timeout to expire and then call Disconnect which will close the socket and reset the stream to its state. However, the Send method can still be called while the writer is processing its work queue, and after that there's nothing that would cancel the Send method calls, which will result in the exact same blocking behavior this PR is trying to prevent.

Comment on lines +39 to +48
ReceiveCheckResults(1, ServiceState::ServiceCritical, [](const CheckResult::Ptr& cr) {
cr->SetOutput(GetRandomString("####", 1024UL * 1024));
});

// Accept the connection, but don't read from it to leave the client hanging.
Accept();
GetDataUntil("####");

// Now try to pause.
PauseWriter();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case doesn't actually test one critical aspect of the pause functionality, which is that it should return immediately (plus minus a 1-2 seconds) after the disconnect timeout expires, even if there is still pending work in the queue. After all, this PR is about making sure that we don't block the shutdown/reload process for an indefinite amount of time, so this should be the most important aspect to test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning is that if a stuck Send() is exited once, it will always return/throw immediately for subsequent queue entries. But I will look into testing that more directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will always return/throw immediately for subsequent queue entries

Currently, it doesn't immediately reject the requests though, but spawns a new coroutine for each request, then the EnsureConnected function detects whether to proceed or not. This is not ideal, as it's spawning unnecessary coroutines and doing extra work for each request for nothing. I think, it should immediately reject the request if the m_Stopped flag is set, without spawning a new coroutine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an early abort to both Send() methods now, but I still haven't found a good way to test with a long queue of multiple blocking Send()s without the unit-tests taking a very long time. Technically perfdata_elasticsearchwriter/pause_with_pending_work does do that, because ElasticsearchWriter sends out the same message for a state-changing check-result twice.

You can leave this conversation open for now and I'm going to revisit this again before we merge this.

@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 2 times, most recently from c039a10 to 2eff710 Compare February 19, 2026 13:07
@jschmidt-icinga
Copy link
Contributor Author

Should have addressed most of your comments @yhabteab, unless otherwise noted in a review comment.

@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 2eff710 to 0a4ad3a Compare February 19, 2026 13:32
Also add a Clear() function to clear existing log content.
There's a set of two tests for each perfdatawriter, just
to make sure they can connect and send data that looks reasonably
correct, and to make sure pausing actually works while the connection
is stuck.

Then there's a more in-depth suite of tests for PerfdataWriterConnection
itself, to verify that connection handling works well in all types
of scenarios.
@jschmidt-icinga jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 0a4ad3a to 20890d5 Compare February 20, 2026 10:16
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why the Cancel method can't be merged with Disconnect, other than that and the 2 inline comments below code wise looks fine to me now. Though, I do have to still look at the tests in detail because I've not done that yet. Also would be nice, if we could talk about this in person, and whether it's worth adding a bit more complexity by templating the Send method.

but I've an idea, how we can de-duplicate the Send method using a template and will talk about it next week.

@@ -140,66 +140,6 @@ void InfluxdbCommonWriter::ExceptionHandler(boost::exception_ptr exp)
//TODO: Close the connection, if we keep it open.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, this TODO can also be removed now??


IoEngine::SpawnCoroutine(m_Strand, [&](boost::asio::yield_context yc) {
try {
Disconnect(std::move(yc));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an idea, you could also change Disconnect to take the yield context by reference instead.

Comment on lines +211 to +213
if (!stream->next_layer().IsVerifyOK()) {
BOOST_THROW_EXCEPTION(std::runtime_error{"TLS certificate validation failed"});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now omitted the actual TLS handshake error:

if (!tlsStream.IsVerifyOK()) {
BOOST_THROW_EXCEPTION(std::runtime_error(
"TLS certificate validation failed: " + std::string(tlsStream.GetVerifyError())
));

#include <random>
#include <sstream>
#include <boost/test/unit_test.hpp>
#include <boost/random.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This include seems a leftover from a previous implementation, since you seem to be using a std random generator instead.

Comment on lines +13 to +20
#define CHECK_JOINS_WITHIN(t, timeout) \
BOOST_REQUIRE_MESSAGE(t.TryJoinFor(timeout), "Thread not joinable within timeout.")
#define TEST_JOINS_WITHIN(t, timeout) \
BOOST_REQUIRE_MESSAGE(t.TryJoinFor(timeout), "Thread not joinable within timeout.")

#define REQUIRE_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.")
#define CHECK_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.")
#define TEST_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These macros don't seem to be used anywhere.

}

template<class Rep, class Period>
bool TryJoinFor(std::chrono::duration<Rep, Period> timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name TryJoinFor doesn't reflect what the function does. It should be renamed to something like TryJoinAfter or TryJoinWithin to better convey its purpose.

return false;
}

bool TryJoin() { return TryJoinFor(std::chrono::milliseconds{0}); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently unused.

* @param state The state the check results should have
* @param fn A function that will be passed the current check-result
*/
void ReceiveCheckResults(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another implementation of this for the notification component, so instead of duplicating it here I would outsource this into the utils file and change its arguments to also accept the checkable object.

*/
BOOST_AUTO_TEST_CASE(connection_refused)
{
CloseAcceptor();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is superfluous call, since you never started the acceptor.

ResetSocket();
}

void ResetSocket()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this ResetStream just like in the PerfdataWriterConnection class.

return ret;
}

std::size_t ReadRemainingData()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused.

);
}

void CloseConnection()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InfluxDB2 Writer (and quite possibly all other Data Outputs) may inhibit core functionality "systemctl reload icinga2" hangs

2 participants