Skip to content

CpuBoundWork#CpuBoundWork(): don't spin on atomic int to acquire slot #9990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Al2Klimov
Copy link
Member

@Al2Klimov Al2Klimov commented Feb 7, 2024

This is inefficient and implies
possible bad surprises regarding waiting durations on busy nodes. Instead,
use AsioConditionVariable#Wait() if there are no free slots. It's notified
by others' CpuBoundWork#~CpuBoundWork() once finished.

fixes #9988

Also, the current implementation is a spin-lock. 🙈 #10117 (comment)

@Al2Klimov Al2Klimov requested a review from yhabteab February 7, 2024 10:42
@Al2Klimov Al2Klimov self-assigned this Feb 7, 2024
@cla-bot cla-bot bot added the cla/signed label Feb 7, 2024
@icinga-probot icinga-probot bot added area/api REST API area/distributed Distributed monitoring (master, satellites, clients) core/quality Improve code, libraries, algorithms, inline docs ref/IP labels Feb 7, 2024
@Al2Klimov
Copy link
Member Author

Low load test

[2024-02-07 12:16:44 +0100] information/ApiListener: New client connection from [::ffff:127.0.0.1]:51732 (no client certificate)
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12
[2024-02-07 12:16:44 +0100] information/HttpServerConnection: Request: GET /v1/objects/services/3d722963ae43!4272 (from [::ffff:127.0.0.1]:51732), user: root, agent: curl/8.4.0, status: Not Found).
[2024-02-07 12:16:44 +0100] information/HttpServerConnection: HTTP client disconnected (from [::ffff:127.0.0.1]:51732)
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Using one free slot, free: 12 => 11
[2024-02-07 12:16:44 +0100] information/CpuBoundWork: Releasing one used slot, free: 11 => 12

Just a few increments/decrements. 👍

@Al2Klimov
Copy link
Member Author

High load test

If I literally DoS Icinga with https://github.com/Al2Klimov/i2all.tf/tree/master/i2dos, I get a few of these:

[2024-02-07 12:19:37 +0100] information/CpuBoundWork: Handing over one used slot, free: 0 => 0

After I stop that program and fire one curl as in my low load test above, I get the same picture: still 12 free slots. 👍

Logs

--- lib/base/io-engine.cpp
+++ lib/base/io-engine.cpp
@@ -24,6 +24,7 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc)
        std::unique_lock<std::mutex> lock (sem.Mutex);

        if (sem.FreeSlots) {
+               Log(LogInformation, "CpuBoundWork") << "Using one free slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots - 1u;
                --sem.FreeSlots;
                return;
        }
@@ -32,7 +33,9 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc)

        sem.Waiting.emplace(&cv);
        lock.unlock();
+       Log(LogInformation, "CpuBoundWork") << "Waiting...";
        cv.Wait(yc);
+       Log(LogInformation, "CpuBoundWork") << "Waited!";
 }

 void CpuBoundWork::Done()
@@ -42,8 +45,10 @@ void CpuBoundWork::Done()
                std::unique_lock<std::mutex> lock (sem.Mutex);

                if (sem.Waiting.empty()) {
+                       Log(LogInformation, "CpuBoundWork") << "Releasing one used slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots + 1u;
                        ++sem.FreeSlots;
                } else {
+                       Log(LogInformation, "CpuBoundWork") << "Handing over one used slot, free: " << sem.FreeSlots << " => " << sem.FreeSlots;
                        sem.Waiting.front()->Set();
                        sem.Waiting.pop();
                }

@Al2Klimov Al2Klimov removed their assignment Feb 7, 2024
@Al2Klimov Al2Klimov marked this pull request as ready for review February 7, 2024 11:24
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from c11989f to 8d24525 Compare February 9, 2024 13:03
@julianbrost
Copy link
Contributor

Is the way boost::asio::deadline_timer is used by AsioConditionVariable here actually safe? It's documentation says that shared objects are not thread-safe.

@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 8d24525 to bf74280 Compare February 20, 2024 12:21
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from bf74280 to 9062934 Compare February 21, 2024 11:13
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 9062934 to a00262f Compare February 22, 2024 09:58
@Al2Klimov Al2Klimov added this to the 2.15.0 milestone Apr 5, 2024
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from a00262f to 26ef66e Compare September 27, 2024 09:51
@Al2Klimov
Copy link
Member Author

In addition, v2.14.2 could theoretically misbehave once the free slot amount falls "temporarily" noticeably below zero. Like, three requestors achieve an ioEngine.m_CpuBoundSemaphore.fetch_sub(1) while it's zero (0 - 3 x 1 = -3). Now, requestor A realizes that it's not allowed to take that slot and adds 1 again (-2). And so does requestor B (-1). But A subtracts again (-2) before C also adds 1 again (-1). And so on.

https://github.com/Icinga/icinga2/blob/v2.14.2/lib/base/io-engine.cpp#L24-L31

So that spinlock blocks not only CPU time, but also slots from legit requestors. The father of all spinlocks, so to say. 🙈 #10117 (comment)

@@ -105,7 +106,7 @@ bool EventsHandler::HandleRequest(
response.result(http::status::ok);
response.set(http::field::content_type, "application/json");

IoBoundWorkSlot dontLockTheIoThread (yc);
handlingRequest.Done();
Copy link
Member Author

@Al2Klimov Al2Klimov Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've figured out and pushed 75271fe how to get rid of this and the 38 changed files.

$ git diff --stat 74009f0fc..a663f98b2
 lib/base/io-engine.cpp              | 77 +++++++++++++++++++++++++++++++++++++++--------------------------------------
 lib/base/io-engine.hpp              | 64 +++++++++++++++++++++++++++++++++-------------------------------
 lib/remote/eventshandler.cpp        |  2 --
 lib/remote/httpserverconnection.cpp | 19 +++++++++++++------
 lib/remote/httpserverconnection.hpp |  2 ++
 lib/remote/jsonrpcconnection.cpp    |  2 +-
 6 files changed, 88 insertions(+), 78 deletions(-)
$

Please let me know whether you consider this better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've figured out and pushed 75271fe how to get rid of this and the 38 changed files.

424e1bc (#9990) is an annoying commit, but not something that would have to block this PR.

On the change suggested in 75271fe: that sounds like moving into the direction I suggested in #10142:

Related: when doing bigger changes to the interface there, one other improvement that comes to mind is how HttpServerConnection::StartStreaming() works: currently, to take control over the whole connection, this has to be called, but the underlying ASIO stream is still passed to every handler but it must not be used without calling StartStreaming(), otherwise, there's a good chance the connection ends up in a broken state. This could be improved by only exposing the underlying stream as a return value of the StartStreaming() method, similar to how it works in Go's net/http package.

So if the point of StartStreaming() is to transfer ownership of the connection to the caller, for me it makes sense to release other resources related to the connection like the CpuBoundWork.

Thus, overall I'd say this is a sane change for StartStreaming(), so feel free to keep it in (but the commit history should be cleaned up, that doesn't need that revert commit in there).

}
try {
cv->Wait(yc);
} catch (...) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes – this can throw.

--- test/base-shellescape.cpp
+++ test/base-shellescape.cpp
@@ -1,6 +1,8 @@
 /* Icinga 2 | (c) 2012 Icinga GmbH | GPLv2+ */

 #include "base/utility.hpp"
+#include <boost/asio.hpp>
+#include <boost/asio/spawn.hpp>
 #include <BoostTestTargetConfig.h>
 #include <iostream>

@@ -16,6 +18,26 @@ BOOST_AUTO_TEST_CASE(escape_basic)
        BOOST_CHECK(Utility::EscapeShellCmd("$PATH") == "\\$PATH");
        BOOST_CHECK(Utility::EscapeShellCmd("\\$PATH") == "\\\\\\$PATH");
 #endif /* _WIN32 */
+
+       auto io = new boost::asio::io_context;
+       boost::asio::spawn(*io, [io](boost::asio::yield_context yc) {
+               boost::asio::deadline_timer timer(*io, boost::posix_time::seconds(3));
+               boost::system::error_code ec;
+
+               try {
+                       timer.async_wait(yc[ec]);
+               } catch (const boost::coroutines::detail::forced_unwind&) {
+                       BOOST_CHECK(false); // error: in "base_shellescape/escape_basic": check false has failed
+                       throw;
+               }
+       });
+       boost::asio::deadline_timer timer(*io, boost::posix_time::seconds(1));
+       timer.async_wait([io](boost::system::error_code ec) {
+               if (!ec) {
+                       delete io;
+               }
+       });
+       io->run();
 }

 BOOST_AUTO_TEST_CASE(escape_quoted)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes – this can throw.

You can't be serious :), you deliberately deleted the I/O object with delete io;, what do you expect to happen in that case then? This is not a realistic test case, I mean where do we perform a questionable operation like this in Icinga 2 code? @julianbrost and I tried to trigger this exception last week but weren't able to, and reading the detailed 🙃 boost docs about it didn't help to understand this either.

If for some reason the I/O context gets deleted in Icinga 2, do you think you can ever recover from it? If something like this happens in Icinga 2, then you have far more severe problems than unreleased CPU semaphores.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something like this happens in Icinga 2, then you have far more severe problems than unreleased CPU semaphores.

So you'd not catch it at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd not catch it at all?

No, I didn't say that! I just want to understand under what normal circumstances such an exception would be triggered and not by deleting the global io object. For instance, how can you explicitly trigger a stack unwinding of a coroutine? If one can force the destruction of a coroutine, then you can also verify that it enters into that new catch-all handler as intended, but none of us is able to do that, and that is the puzzle that needs to be solved.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the puzzle that needs to be solved

Does it need to be resolved? To me, the bare lack of noexcept in async_wait is enough. Yes, I'm serious! If something could be thrown, I catch it and clean up after myself. As I said to Julian, if you want to test it, add a throw 1;. The most often used function that theoretically can throw an exception is operator new, but I don't know yet how to temporarily override malloc(3) locally. But I think I don't even need that, as #9990 (comment) already shows enough. (Also, I didn't test what happens across fork(2), I hope on_before_fork() (or whatever it's called in ASIO) preserves coroutine stacks.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though if even deleting the io object triggers this exception (also tested with IoEngine::SpawnCoroutine()), I guess it's an indicator for that if the coroutine gets destroyed for whatever reason, we'll be able to intercept it and handle it accordingly.

Copy link
Member

@yhabteab yhabteab Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see your intermediate comment above while sending my previous comment!

As I said to Julian, if you want to test it, add a throw 1;.

Where should I put that throw expression? Throwing an exception within the coroutine handler (the provided callback that is called within the coroutine) does not cause the coroutine to be destroyed.

Does it need to be resolved? To me, the bare lack of noexcept in async_wait is enough

If you don't want to make decisions just based on assumptions, then yes you need to understand when this exception could be triggered. I'm not talking about just any exception, but the specific force_unwind exception. As I said before, if the user-supplied callback throws an exception, it will never hit that new catch-all handler here, nor will it destroy the coroutine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said to Julian, if you want to test it, add a throw 1;.

Where should I put that throw expression?

Just inside the try catch instead of cv->Wait(yc);. Seriously. Actually I don't worry especially about a particular exception type. I just easily got forced_unwind in my comment above. Who knows, maybe a malloc(3) fails? It's just: theoretically, the method could throw – I handle it. Especially in this case the code above in CpuBoundWork#CpuBoundWork() has already deployed some pointers to stack variables to IoEngine#m_CpuBoundSemaphore.Waiting. If our super unlikely exception hits someone w/o my try/catch once in 10y, happy debugging!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now "better", colleagues?

--- a/lib/base/io-engine.cpp
+++ b/lib/base/io-engine.cpp
@@ -30,2 +30,5 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc, boost::asio::io_contex
                        if (ie.m_CpuBoundSemaphore.compare_exchange_weak(freeSlots, freeSlots - 1)) {
+                               if (cv) {
+                                       Log(LogCritical, "LOLCAT", "Got slot via ASIO!");
+                               }
                                return;
@@ -36,3 +39,3 @@ CpuBoundWork::CpuBoundWork(boost::asio::yield_context yc, boost::asio::io_contex
                        cv = Shared<AsioConditionVariable>::Make(ie.GetIoContext());
-                       continue;
+                       //continue;
                }
[2024-12-05 17:37:35 +0100] critical/LOLCAT: Got slot via ASIO!
[2024-12-05 17:37:35 +0100] critical/LOLCAT: Got slot via ASIO!
[2024-12-05 17:37:35 +0100] critical/LOLCAT: Got slot via ASIO!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it works well, a DoS even triggers the above logs, but the fair scheduling is unsurprisingly gone. :(

Also, FWIW, I gave up on df63a78 (boost::asio::async_result).

@Al2Klimov Al2Klimov requested review from yhabteab and oxzi November 29, 2024 09:19
@@ -329,7 +331,8 @@ bool EnsureValidBody(
ApiUser::Ptr& authenticatedUser,
boost::beast::http::response<boost::beast::http::string_body>& response,
bool& shuttingDown,
boost::asio::yield_context& yc
boost::asio::yield_context& yc,
boost::asio::io_context::strand& strand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strand parameter is added but never used?

Comment on lines 93 to 90
// But even if it happens before (and gets lost), it's not a problem because now the ongoing subscriber
// will lock the mutex and re-check the semaphore which is already >0 (fast path) due to fetch_add().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I wrote in the quote block is supposed to describe everything than can happen inside if (ie.m_CpuBoundSemaphore.fetch_add(1) < 1) (in combination with others trying to acquire a slot).

You mean, in short, if multiple ones try to claim the last remaining slot

Not just the last remaining slot, it's also about the interactions that happen when multiple ones try to claim the only slot that currently in the process of being released.

@Al2Klimov Al2Klimov requested a review from julianbrost January 15, 2025 11:14
Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've asked for an explicit review: Please update code comments (#9990 (comment)), address the comments from the previous review (#9990 (review)). Apart from that, I also agree with the comments in Yonas' review (#9990 (review)).

@Al2Klimov Al2Klimov removed this from the 2.15.0 milestone Jan 21, 2025
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 36edc5a to 824211b Compare January 24, 2025 17:01
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 824211b to 8e73397 Compare January 27, 2025 15:31
@Al2Klimov

This comment was marked as resolved.

@Al2Klimov
Copy link
Member Author

I'm done with all of that (I think).

@Al2Klimov Al2Klimov added this to the 2.15.0 milestone Mar 18, 2025
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 8e73397 to df4f626 Compare April 30, 2025 15:33
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch 2 times, most recently from 65b9526 to 1cccf84 Compare May 14, 2025 13:35
yhabteab
yhabteab previously approved these changes May 15, 2025
Copy link
Member

@yhabteab yhabteab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the horrible interface design of the HttpServerConnection class, looks fine for me now! Tested it with this script!

$ cat post-checkresults.sh
#!/bin/zsh
for x in $(seq "$1"); do
	curl -sSk \
		-u root:icinga \
		-o /dev/null \
		-H 'Content-Type: application/json' \
		-H 'Accept: application/json' \
		-X POST 'https://localhost:5667/v1/actions/process-check-result' -d@- <<EOF &
{
	"type": "Service",
	"filter": "host.name==\"big-switch-server-155\" || host.name==\"big-switch-server-156\"",
	"exit_status": $(( RANDOM % 4 )),
	"plugin_output": "$(( RANDOM ))",
	"pretty": true
}
EOF
done

wait $(jobs -p)
---
$ while :; do ./post-checkresults.sh 100; printf .; done

Al2Klimov added 8 commits July 8, 2025 17:25
so that /v1/events doesn't have to use IoBoundWorkSlot. IoBoundWorkSlot#~IoBoundWorkSlot() will wait for a free semaphore slot which will be almost immediately released by CpuBoundWork#~CpuBoundWork(). Just releasing the already aquired slot in HttpServerConnection#StartStreaming() is more efficient.
This is inefficient and involves possible bad surprises regarding waiting durations on busy nodes. Instead, use AsioConditionVariable#Wait() if there are no free slots. It's notified by others' CpuBoundWork#~CpuBoundWork() once finished.
@Al2Klimov Al2Klimov force-pushed the re-think-cpuboundwork-implementation-9988 branch from 1cccf84 to e4b73f3 Compare July 8, 2025 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api REST API area/distributed Distributed monitoring (master, satellites, clients) cla/signed core/quality Improve code, libraries, algorithms, inline docs ref/IP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-think CpuBoundWork implementation and usage
3 participants