Add PerfdatawriterConnection to handle network requests for Perfdata Writers by jschmidt-icinga · Pull Request #10668 · Icinga/icinga2

jschmidt-icinga · 2025-12-10T10:00:35Z

Description

This unifies the connection handling for all perfdata writers into a single class PerfdataWriterConnection that provides a blocking interface (using promises) to the underlying asynchronous operations.

All in all this is a huge code reduction and deduplication (as long as you don't count the added unit-tests) and should fix the issues with the work-queues being stuck on shutdown.

Fixes #10159, possibly fixes #10629

Connection handling

Connections are now established lazily whenever a message is being sent to the server. Some writers already worked that way, while others connected at the start and kept their connections around for as long as they needed them or until the server disconected.
HTTP-based writers will disconnect after sending a message and receiving the response unless the keep-alive flag is set by the server. Currently we do not request keep-alive on our side, but that could easily be done on the side of the writers if we want to.
A disconnect timeout can be started where after it expires the connection will be disconnected and enter a state where no further attempts will be made at reconnecting. When the timeout expires it will also cancel all outstanding operations, including a slow/unresponsive send and TLS handshake.
All system errors are handled by the connection class internally and lead to a retry after an exponentially increasing timeout similar to the backoff strategy implemented by Add OTLPMetricsWriter #10685. The writers obviously still need to handle the HTTP status codes from the response, which the connection class doesn't touch in any way.

Rationale

A simpler solution to the disconnect problem would have been possible. Because a cancelled send or handshake don't allow for a graceful shutdown of the TLS connection anyway, especially when the server is unresponsive, a simple close on the stream's socket would be enough to cancel all outstanding operations. However, many writers only keep temporary stream objects in the functions where the messages are sent and currently don't track the state of the connection, so this would also need some serious refactoring but different for each writer.

Instead of doing the same thing over and over for each writer, I chose to reduce code duplication and abstract the connection handling out of the individual writers and only fix it in one place. Using async operations and an asio strand was convenient, because now every yield leaves the connection object in a defined state, without needing any atomic variables or mutexes, which makes the disconnect handling much simpler.

Other changes

In addition to the changes to connection handling some other minor refactoring has been done:

OpenTsdbWriter now also uses a work queue like all the other writers. Previously this writer would send its data directly in its CheckResultHandler which meant that if a server was slow or unresponsive it could have blocked check-result processing and slowed down the whole process/cluster.
ElasticsearchWriter locked a mutex on each Flush() so it could be called from both outside and inside the work queue. This was changed to always queuing the Flush() onto the work queue instead. This makes the behavior more similar to what InfluxDbCommonWriter does. It has been pointed out that in case of an unresponsive writer, this will queue more and more calls to Flush() onto the queue, which shouldn't be a problem because the queue is relatively huge (10000000 items) and if a writer is stuck so long that a 10s flush timer fills up this queue it has since been filled up ten times over by unprocessed messages. It would be relatively easy to fix by just stopping and restarting the timer after each flush has gone through the queue.

More refactoring could be done on the HTTP-based writers (InfluxDb and Elasticsearch) in the future. For example they could make use of the new HttpMessage classes in remote/httpmessage.hpp so they can directly push their objects into the body of a request instead of joining them with new-lines. Both writers could also make use of chunked encoding and stream their ndjson formatted messages until a timeout expires. I've left that out of this PR because it isn't necessary to fix the underlying issue, but with PerfdataWriterConnection could easily be extended in the future to make this possible.

Status

Ready and waiting for reviews. I still have to manually test with the real backends at some point but I don't expect any differences.

yhabteab

Just some nitpicking, nothing special!

lib/perfdata/elasticsearchwriter.cpp

lib/perfdata/CMakeLists.txt

lib/perfdata/opentsdbwriter.cpp

lib/perfdata/perfdatawriterconnection.cpp

jschmidt-icinga · 2026-02-12T13:30:11Z

Updates:

Simplified connection state tracking. Most of it was from an earlier iteration and was no longer necessary. I've also removed the special handling of the initial state to skip the timeout. This was added when I didn't have the handlers call a "manual" disconnect after the work queue is finished anyway.
Increased the default for the disconnect timeout to 10s, up from the 0.5s it was before (we can change this any time and it is configurable anyway).
Made the flush timer on ElasticsearchWriter and InfluxDb*Writer restart itself only when the flush has gone through the work queue.
Other improvements suggested by @yhabteab above.

yhabteab

Here are only 3 comments for the time being but I've an idea, how we can de-duplicate the Send method using a template and will talk about it next week.

lib/perfdata/elasticsearchwriter.cpp

yhabteab · 2026-02-13T16:23:23Z

lib/perfdata/perfdatawriterconnection.cpp

+						boost::asio::streambuf buf;
+						boost::beast::http::async_read(*stream, buf, parser, yc);


This is definitely a case for our newly introduced IncomingHttpResponse class. The Parse method doesn't exist yet, so you'll need to add it.

Suggested change

boost::asio::streambuf buf;

boost::beast::http::async_read(*stream, buf, parser, yc);

IncomingHttpResponse response{stream};

response.Parse(yc);

if (!response.keep_alive()) {

Disconnect(yc);

}

promise.set_value(response.Parser().release());

Even now I still don't see the point of using these classes. If it doesn't make the code any simpler and I can't use both of them for the symmetry, I'd rather stick with the plain boost::beast types, at least for now. I'll give it one last shot to see if I can use them with a few modifications and now that everything has gotten so much simpler.

yhabteab

I'm done looking with non-test code now, and left aprt from the following question/notice some inline comments as well.

Apologize, if I'm missing something obvious, but your connection handling class doesn't seem to prevent any post Disconnect requests. If I recall correctly, you said that you don't want to reuse the same instance after a disconnect, but I don't see in the codebase that would enforce that behavior. Shouldn't the Send method reject requests when called after the *Disconnect* methods have done their work? Think of a scenario where a writer has a bunch of items in its queue and has currently no working connection. If the writer is paused, it will call StartDisconnectTimeout and then immediately join the work queue, which will wait until all items are processed. In the meantime, the StartDisconnectTimeout will wait for the timeout to expire and then call Disconnect which will close the socket and reset the stream to its state. However, the Send method can still be called while the writer is processing its work queue, and after that there's nothing that would cancel the Send method calls, which will result in the exact same blocking behavior this PR is trying to prevent.

lib/perfdata/perfdatawriterconnection.cpp

test/perfdata-elasticsearchwriter.cpp

yhabteab · 2026-02-16T15:24:50Z

test/perfdata-elasticsearchwriter.cpp

+	ReceiveCheckResults(1, ServiceState::ServiceCritical, [](const CheckResult::Ptr& cr) {
+		cr->SetOutput(GetRandomString("####", 1024UL * 1024));
+	});
+
+	// Accept the connection, but don't read from it to leave the client hanging.
+	Accept();
+	GetDataUntil("####");
+
+	// Now try to pause.
+	PauseWriter();


This test case doesn't actually test one critical aspect of the pause functionality, which is that it should return immediately (plus minus a 1-2 seconds) after the disconnect timeout expires, even if there is still pending work in the queue. After all, this PR is about making sure that we don't block the shutdown/reload process for an indefinite amount of time, so this should be the most important aspect to test.

The reasoning is that if a stuck Send() is exited once, it will always return/throw immediately for subsequent queue entries. But I will look into testing that more directly.

it will always return/throw immediately for subsequent queue entries

Currently, it doesn't immediately reject the requests though, but spawns a new coroutine for each request, then the EnsureConnected function detects whether to proceed or not. This is not ideal, as it's spawning unnecessary coroutines and doing extra work for each request for nothing. I think, it should immediately reject the request if the m_Stopped flag is set, without spawning a new coroutine.

I've added an early abort to both Send() methods now, but I still haven't found a good way to test with a long queue of multiple blocking Send()s without the unit-tests taking a very long time. Technically perfdata_elasticsearchwriter/pause_with_pending_work does do that, because ElasticsearchWriter sends out the same message for a state-changing check-result twice.

You can leave this conversation open for now and I'm going to revisit this again before we merge this.

lib/perfdata/perfdatawriterconnection.cpp

jschmidt-icinga · 2026-02-19T13:15:47Z

Should have addressed most of your comments @yhabteab, unless otherwise noted in a review comment.

lib/perfdata/gelfwriter.cpp

lib/perfdata/opentsdbwriter.cpp

lib/perfdata/perfdatawriterconnection.cpp

test/test-thread.hpp

Also add a Clear() function to clear existing log content.

There's a set of two tests for each perfdatawriter, just to make sure they can connect and send data that looks reasonably correct, and to make sure pausing actually works while the connection is stuck. Then there's a more in-depth suite of tests for PerfdataWriterConnection itself, to verify that connection handling works well in all types of scenarios.

lib/perfdata/gelfwriter.cpp

yhabteab

I'm still not sure why the Cancel method can't be merged with Disconnect, other than that and the 2 inline comments below code wise looks fine to me now. Though, I do have to still look at the tests in detail because I've not done that yet. Also would be nice, if we could talk about this in person, and whether it's worth adding a bit more complexity by templating the Send method.

but I've an idea, how we can de-duplicate the Send method using a template and will talk about it next week.

yhabteab · 2026-02-20T11:42:12Z

lib/perfdata/influxdbcommonwriter.cpp

@@ -140,66 +140,6 @@ void InfluxdbCommonWriter::ExceptionHandler(boost::exception_ptr exp)
 	//TODO: Close the connection, if we keep it open.


I would say, this TODO can also be removed now??

yhabteab · 2026-02-20T11:50:16Z

lib/perfdata/perfdatawriterconnection.cpp

+
+	IoEngine::SpawnCoroutine(m_Strand, [&](boost::asio::yield_context yc) {
+		try {
+			Disconnect(std::move(yc));


Just as an idea, you could also change Disconnect to take the yield context by reference instead.

yhabteab · 2026-02-20T12:08:34Z

lib/perfdata/perfdatawriterconnection.cpp

+							if (!stream->next_layer().IsVerifyOK()) {
+								BOOST_THROW_EXCEPTION(std::runtime_error{"TLS certificate validation failed"});
+							}


This now omitted the actual TLS handshake error:

icinga2/lib/perfdata/influxdbcommonwriter.cpp

Lines 192 to 195 in 27c954d

if (!tlsStream.IsVerifyOK()) {

BOOST_THROW_EXCEPTION(std::runtime_error(

"TLS certificate validation failed: " + std::string(tlsStream.GetVerifyError())

));

yhabteab · 2026-02-20T15:29:56Z

test/utils.cpp

+#include <random>
 #include <sstream>
 #include <boost/test/unit_test.hpp>
+#include <boost/random.hpp>


This include seems a leftover from a previous implementation, since you seem to be using a std random generator instead.

yhabteab · 2026-02-20T15:37:01Z

test/test-thread.hpp

+#define CHECK_JOINS_WITHIN(t, timeout)                                                  \
+	BOOST_REQUIRE_MESSAGE(t.TryJoinFor(timeout), "Thread not joinable within timeout.")
+#define TEST_JOINS_WITHIN(t, timeout)                                                   \
+	BOOST_REQUIRE_MESSAGE(t.TryJoinFor(timeout), "Thread not joinable within timeout.")
+
+#define REQUIRE_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.")
+#define CHECK_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.")
+#define TEST_JOINABLE(t) BOOST_REQUIRE_MESSAGE(t.Joinable(), "Thread not joinable.")


These macros don't seem to be used anywhere.

yhabteab · 2026-02-20T15:40:01Z

test/test-thread.hpp

+	}
+
+	template<class Rep, class Period>
+	bool TryJoinFor(std::chrono::duration<Rep, Period> timeout)


The name TryJoinFor doesn't reflect what the function does. It should be renamed to something like TryJoinAfter or TryJoinWithin to better convey its purpose.

yhabteab · 2026-02-20T15:40:48Z

test/test-thread.hpp

+		return false;
+	}
+
+	bool TryJoin() { return TryJoinFor(std::chrono::milliseconds{0}); }


This is currently unused.

yhabteab · 2026-02-20T15:56:53Z

test/perfdata-perfdatawriterfixture.hpp

+	 * @param state The state the check results should have
+	 * @param fn A function that will be passed the current check-result
+	 */
+	void ReceiveCheckResults(


There's another implementation of this for the notification component, so instead of duplicating it here I would outsource this into the utils file and change its arguments to also accept the checkable object.

yhabteab · 2026-02-20T16:00:55Z

test/perfdata-perfdatawriterconnection.cpp

+ */
+BOOST_AUTO_TEST_CASE(connection_refused)
+{
+	CloseAcceptor();


This is superfluous call, since you never started the acceptor.

yhabteab · 2026-02-20T16:13:06Z

test/perfdata-perfdatatargetfixture.hpp

+		ResetSocket();
+	}
+
+	void ResetSocket()


I would name this ResetStream just like in the PerfdataWriterConnection class.

yhabteab · 2026-02-20T16:19:57Z

test/perfdata-perfdatatargetfixture.hpp

+		return ret;
+	}
+
+	std::size_t ReadRemainingData()


This is unused.

yhabteab · 2026-02-20T16:20:32Z

test/perfdata-perfdatatargetfixture.hpp

+		);
+	}
+
+	void CloseConnection()


cla-bot bot added the cla/signed label Dec 10, 2025

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 9 times, most recently from cb64ef1 to 21c2575 Compare December 12, 2025 07:45

jschmidt-icinga mentioned this pull request Dec 15, 2025

Create a new ElasticsearchDatastreamWriter to more efficiently store data in elasticsearch. #10577

Merged

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 21c2575 to 29f91c9 Compare December 15, 2025 11:08

jschmidt-icinga mentioned this pull request Dec 15, 2025

(WIP) Add ElasticsearchDatastreamWriter (Follow-up to #10577) #10677

Draft

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 2 times, most recently from d018cae to a66f9ed Compare December 17, 2025 14:15

This was referenced Jan 22, 2026

Add OTLPMetricsWriter #10685

Open

Refactor HttpMessage into generalized templated types #10692

Merged

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 4 times, most recently from 6391859 to 8ddd29d Compare January 28, 2026 11:35

jschmidt-icinga added this to the 2.16.0 milestone Jan 29, 2026

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 8 times, most recently from 40c5481 to 4acf0b3 Compare February 4, 2026 07:50

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 4 times, most recently from df8e74e to 32639d2 Compare February 6, 2026 12:34

jschmidt-icinga changed the title ~~(WIP) Add PerfdatawriterConnection to handle network requests for Perfdata Writers~~ Add PerfdatawriterConnection to handle network requests for Perfdata Writers Feb 6, 2026

jschmidt-icinga marked this pull request as ready for review February 6, 2026 12:35

jschmidt-icinga requested a review from yhabteab February 6, 2026 12:35

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 32639d2 to a20ccae Compare February 6, 2026 13:24

yhabteab reviewed Feb 10, 2026

View reviewed changes

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from a20ccae to 7909e90 Compare February 12, 2026 13:19

yhabteab reviewed Feb 13, 2026

View reviewed changes

yhabteab reviewed Feb 16, 2026

View reviewed changes

lib/perfdata/perfdatawriterconnection.cpp Outdated Show resolved Hide resolved

lib/perfdata/perfdatawriterconnection.cpp Outdated Show resolved Hide resolved

lib/perfdata/perfdatawriterconnection.cpp Outdated Show resolved Hide resolved

yhabteab reviewed Feb 16, 2026

View reviewed changes

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch 2 times, most recently from c039a10 to 2eff710 Compare February 19, 2026 13:07

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 2eff710 to 0a4ad3a Compare February 19, 2026 13:32

yhabteab reviewed Feb 19, 2026

View reviewed changes

jschmidt-icinga added 7 commits February 20, 2026 09:52

Import std::chrono_literals into icinga namespace

f4d40c7

Refactor OpenTsdbWriter to use a WorkQueue

8351ff1

Add TestThread class to not get unit-tests stuck in join()s

2cb74d4

Add Assert-Macros for the TestLogger

2923d4a

Also add a Clear() function to clear existing log content.

Add PerfdataWriterConnection class

aae0af5

Use PerfdataWriterConnection in perfdata writers

4219e31

jschmidt-icinga force-pushed the perfdata-writers-connection-handling branch from 0a4ad3a to 20890d5 Compare February 20, 2026 10:16

jschmidt-icinga commented Feb 20, 2026

View reviewed changes

lib/perfdata/gelfwriter.cpp Show resolved Hide resolved

yhabteab reviewed Feb 20, 2026

View reviewed changes

		boost::asio::streambuf buf;
		boost::beast::http::async_read(*stream, buf, parser, yc);

-						boost::asio::streambuf buf;
-						boost::beast::http::async_read(*stream, buf, parser, yc);
+						IncomingHttpResponse response{stream};
+						response.Parse(yc);
+						if (!response.keep_alive()) {
+							Disconnect(yc);
+						}
+						promise.set_value(response.Parser().release());

		@@ -140,66 +140,6 @@ void InfluxdbCommonWriter::ExceptionHandler(boost::exception_ptr exp)
		//TODO: Close the connection, if we keep it open.

	if (!tlsStream.IsVerifyOK()) {
	BOOST_THROW_EXCEPTION(std::runtime_error(
	"TLS certificate validation failed: " + std::string(tlsStream.GetVerifyError())
	));

Comments

Conversation

jschmidt-icinga commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Connection handling

Rationale

Other changes

Status

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jschmidt-icinga commented Feb 12, 2026

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jschmidt-icinga commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yhabteab left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschmidt-icinga commented Dec 10, 2025 •

edited

Loading