Fix performance bug in MultiThreadedExecutor (hopefully) #1547

michaeltandy · 2025-11-25T23:50:39Z

Description

This one-line change should improve the performance of MultiThreadedExecutor.

_wait_for_ready_callbacks yields, among other things, tasks that are in progress, if (not task.executing() and not task.done()
In a SingleThreadedExecutor this logic works fine - because yielded tasks are always completed before the next call to _wait_for_ready_callbacks
But in a MultiThreadedExecutor yielded tasks are merely submitted to a threadpool before the next call to _wait_for_ready_callbacks - there's no guarantee they'll have started executing. A task that's waiting in the threadpool queue can be returned and enqueued again.
What's more, so long as there's a task that hasn't started executing yet, _wait_for_ready_callbacks returns quickly, rather than waiting.
However, as task.py's __call__ method includes a check - if (not self._pending() or self._executing): return - the task doesn't actually run beyond that check repeatedly, despite being enqueued repeatedly.
The problem seems to be worse than you'd expect it to be by chance; I speculate that so long as the spin() thread holds the GIL, the thread pool executor's workers can't even begin to make progress.

Running the Zachary's example from #1452 yields the following difference:

Before

root@54d707e498b4:~/loopback# time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] [1764116020.181174240] [clock_sub_with_timer]: Node started with multithreaded executor
[INFO] [1764116020.654835033] [clock_sub_with_timer]: [clock] freq avg: 10.34 Hz | period avg: 96.7 ms ± 60.9 ms
[INFO] [1764116020.671057864] [clock_sub_with_timer]: [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 5.2 ms
[INFO] [1764116025.740914997] [clock_sub_with_timer]: [clock] freq avg: 11.21 Hz | period avg: 89.2 ms ± 99.1 ms
[INFO] [1764116025.742426391] [clock_sub_with_timer]: [timer] freq avg: 5.98 Hz | period avg: 167.2 ms ± 97.2 ms
[INFO] [1764116030.780169855] [clock_sub_with_timer]: [clock] freq avg: 12.34 Hz | period avg: 81.1 ms ± 159.8 ms
[INFO] [1764116030.781428037] [clock_sub_with_timer]: [timer] freq avg: 3.56 Hz | period avg: 281.1 ms ± 284.3 ms
[INFO] [1764116035.954285504] [clock_sub_with_timer]: [clock] freq avg: 6.63 Hz | period avg: 150.9 ms ± 224.4 ms
[INFO] [1764116035.957947068] [clock_sub_with_timer]: [timer] freq avg: 4.11 Hz | period avg: 243.1 ms ± 200.5 ms
[INFO] [1764116041.241116952] [clock_sub_with_timer]: [clock] freq avg: 7.62 Hz | period avg: 131.2 ms ± 196.3 ms
[INFO] [1764116041.242946988] [clock_sub_with_timer]: [timer] freq avg: 3.67 Hz | period avg: 272.8 ms ± 287.1 ms
[INFO] [1764116046.396494238] [clock_sub_with_timer]: [clock] freq avg: 6.29 Hz | period avg: 158.9 ms ± 176.0 ms
[INFO] [1764116046.398510055] [clock_sub_with_timer]: [timer] freq avg: 5.19 Hz | period avg: 192.8 ms ± 143.3 ms

real	0m30.127s
user	0m22.503s
sys	0m0.486s

22.503s/30.127s = 74.7% of a CPU core used, on average.

After

root@54d707e498b4:~/loopback# time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] [1764116169.923305987] [clock_sub_with_timer]: Node started with multithreaded executor
[INFO] [1764116170.344815876] [clock_sub_with_timer]: [clock] freq avg: 11.57 Hz | period avg: 86.4 ms ± 60.4 ms
[INFO] [1764116170.412870650] [clock_sub_with_timer]: [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.4 ms
[INFO] [1764116175.444432305] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [1764116175.512819612] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [1764116180.444488508] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116180.513344804] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116185.444916813] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [1764116185.612434555] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116190.544733712] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [1764116190.612827796] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [1764116195.613330672] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116195.644247121] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms

real	0m30.127s
user	0m0.842s
sys	0m0.184s

0.842s/30.127s = 2.8% of a CPU core used, on average.

Is this user-facing behavior change?

No change.

Did you use Generative AI?

No generative AI was used.

Additional Information

Diff if you want the logging that revealed this issue to me:

diff --git a/rclpy/rclpy/executors.py b/rclpy/rclpy/executors.py
index 7d78e4e..00ae330 100644
--- a/rclpy/rclpy/executors.py
+++ b/rclpy/rclpy/executors.py
@@ -717,6 +717,7 @@ class Executor(ContextManager['Executor']):
                     if (not task.executing() and not task.done() and
                             (node is None or node in nodes_to_use)):
                         yielded_work = True
+                        print(f"Yielding queued task {task}")
                         yield task, entity, node
                 with self._tasks_lock:
                     # Get rid of any tasks that are done
@@ -877,6 +878,7 @@ class Executor(ContextManager['Executor']):
                         if sub.callback_group.can_execute(sub):
                             handler = self._make_handler(sub, node, self._take_subscription)
                             yielded_work = True
+                            print(f"Yielding subscription {handler}")
                             yield handler, sub, node
 
                 for gc in node.guards:
@@ -1065,6 +1067,7 @@ class MultiThreadedExecutor(Executor):
                     if future.done():
                         self._futures.remove(future)
                         future.result()  # raise any exceptions
+            print(f"Executor length {self._executor._work_queue.qsize()}")
 
     def spin_once(self, timeout_sec: Optional[float] = None) -> None:
         # Mark executor as spinning to prevent concurrent spins

As a test I ran the basic publisher and subscriber from the documentation but with the publisher using timer_period = 0.02 and the subscriber using a MultiThreadedExecutor:

def main(args=None):
    rclpy.init(args=args)
    minimal_publisher = MinimalPublisher()
    rclpy.spin(minimal_publisher)
    minimal_publisher.destroy_node()
    rclpy.shutdown()

Here's an excerpt of the logs, before the fix is applied:

Yielding subscription <rclpy.task.Task object at 0x705366eee990>
Executor length 1
I heard: "Hello World: 1327"
Yielding subscription <rclpy.task.Task object at 0x705366f3e9c0>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x705366f3e9c0>
I heard: "Hello World: 1328"
Executor length 1
Yielding subscription <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 3
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 4
I heard: "Hello World: 1329"
Yielding subscription <rclpy.task.Task object at 0x70536c6c2420>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 3
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 4
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 5
I heard: "Hello World: 1330"
Yielding subscription <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 3
I heard: "Hello World: 1331"
Yielding subscription <rclpy.task.Task object at 0x705366e26f90>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 3
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 4
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 5
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 6
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 7
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 8
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 9
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 10
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 11
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 12
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
I heard: "Hello World: 1332"
Executor length 1
Yielding subscription <rclpy.task.Task object at 0x705366eee990>
Executor length 1

…g, this helped a lot with ros2#1452 Signed-off-by: Michael Tandy <[email protected]>

wjwwood · 2025-12-05T00:26:16Z

From issue triage meeting: I put this on the agenda for the next Client Library Working Group meeting.

LSchi-nexat · 2025-12-11T10:50:52Z

We had to use 0.01 instead of 0 for this fix to work.

@wjwwood
This is extremely important. This reduces CPU load of some of our nodes by a factor of 10!
It also applies to the single threaded executor.

Not sure what the underlying issue is, but the CPU performance boost is incredible.

michaeltandy · 2025-12-11T20:21:54Z

@LSchi-nexat Interesting, glad it's worked for you! For me, the sleep of zero worked OK. Can you tell me how many topics you're subscribed to, the message rates of the various topics, and whether you're doing any CPU-intensive processing of any of the messages?

Using 0.01 (i.e. 10 milliseconds) of course means any messages arriving faster than 100 Hz might get dropped. So if you've got 1000 Hz topics, some of your CPU saving might be coming from just not not processing a bunch of messages.

mjcarroll · 2025-12-12T16:42:30Z

Discussion from Client Library Working Group.

Based on researching during the meeting, we are fairly sure that @michaeltandy's analysis is correct. Since this PR has been opened, we have also landed some updates to the logic in the Executor based class here (#1469) which could potentially have impact on the behavior.

It may be that this thread yield is still needed, but before we land it, we are going to check to see if 1469 has any impact.

Actions:

I was unable to reproduce the performance on my system, so @michaeltandy is going to try to repro on his system by building rclpy from source.
I will release rclpy so that we can also test the binaries.

michaeltandy · 2025-12-13T22:52:30Z

I've tested, and @mjcarroll is right to say the current rolling branch has resolved the yield-the-same-task-repeatedly bug, in 9695271e providing a big boost in performance. However, there's still a further (modest) performance improvement available with the addition of the time.sleep(0)

As of 10.0.2

[INFO] Node started with multithreaded executor
[INFO] [clock] freq avg: 10.34 Hz | period avg: 96.7 ms ± 60.9 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 5.2 ms
[INFO] [clock] freq avg: 11.21 Hz | period avg: 89.2 ms ± 99.1 ms
[INFO] [timer] freq avg: 5.98 Hz | period avg: 167.2 ms ± 97.2 ms
[INFO] [clock] freq avg: 12.34 Hz | period avg: 81.1 ms ± 159.8 ms
[INFO] [timer] freq avg: 3.56 Hz | period avg: 281.1 ms ± 284.3 ms
[INFO] [clock] freq avg: 6.63 Hz | period avg: 150.9 ms ± 224.4 ms
[INFO] [timer] freq avg: 4.11 Hz | period avg: 243.1 ms ± 200.5 ms
[INFO] [clock] freq avg: 7.62 Hz | period avg: 131.2 ms ± 196.3 ms
[INFO] [timer] freq avg: 3.67 Hz | period avg: 272.8 ms ± 287.1 ms
[INFO] [clock] freq avg: 6.29 Hz | period avg: 158.9 ms ± 176.0 ms
[INFO] [timer] freq avg: 5.19 Hz | period avg: 192.8 ms ± 143.3 ms

Mostly fixed by 9695271e

$ /usr/bin/time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] Node started with multithreaded executor
[INFO] [clock] freq avg: 11.12 Hz | period avg: 89.9 ms ± 64.5 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.3 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [clock] freq avg: 9.98 Hz | period avg: 100.2 ms ± 5.5 ms
[INFO] [clock] freq avg: 9.93 Hz | period avg: 100.7 ms ± 3.1 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [clock] freq avg: 10.04 Hz | period avg: 99.6 ms ± 14.7 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [clock] freq avg: 9.91 Hz | period avg: 100.9 ms ± 5.5 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.3 ms
[INFO] [clock] freq avg: 10.17 Hz | period avg: 98.4 ms ± 12.6 ms
Command exited with non-zero status 124
2.06user 0.29system 0:30.12elapsed 7%CPU (0avgtext+0avgdata 69504maxresident)k
0inputs+32outputs (0major+9436minor)pagefaults 0swaps

And slightly better again the the addition of the sleep(0)

$ /usr/bin/time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] Node started with multithreaded executor
[INFO] [clock] freq avg: 11.88 Hz | period avg: 84.1 ms ± 61.2 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.5 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
Command exited with non-zero status 124
0.84user 0.29system 0:30.12elapsed 3%CPU (0avgtext+0avgdata 69120maxresident)k
0inputs+32outputs (0major+9438minor)pagefaults 0swaps

Still an improvement, but not the three-orders-of-magnitude I claimed in the pull request.

fujitatomoya

@michaeltandy thanks a lot for sharing the information. i believe your analysis is correct, and this PR fixes that problematic situation in the MultiThreadedExecutor.

fujitatomoya · 2025-12-15T07:08:25Z

rclpy/rclpy/executors.py

                        future.result()  # raise any exceptions

+            # Yield GIL so executor threads have a chance to run.
+            time.sleep(0)


How about this? can be more efficient. it directly tells the OS scheduler to yield the CPU to another thread without any sleep/timer overhead, and it clearly communicates the intent (yield CPU) rather than sleeping.

Suggested change

time.sleep(0)

os.sched_yield() if hasattr(os, 'sched_yield') else time.sleep(0)

michaeltandy force-pushed the MultiThreadedExecutor-performance-bug-fix branch from 1e061ef to 6b15da7 Compare November 25, 2025 23:55

Fix performance bug in MultiThreadedExecutor, hopefully. In my testin…

0fb9834

…g, this helped a lot with ros2#1452 Signed-off-by: Michael Tandy <[email protected]>

michaeltandy force-pushed the MultiThreadedExecutor-performance-bug-fix branch from 6b15da7 to 0fb9834 Compare November 25, 2025 23:56

fujitatomoya reviewed Dec 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix performance bug in MultiThreadedExecutor (hopefully) #1547

Fix performance bug in MultiThreadedExecutor (hopefully) #1547

michaeltandy commented Nov 25, 2025 •

edited

Loading

Uh oh!

wjwwood commented Dec 5, 2025

Uh oh!

LSchi-nexat commented Dec 11, 2025

Uh oh!

michaeltandy commented Dec 11, 2025

Uh oh!

mjcarroll commented Dec 12, 2025

Uh oh!

michaeltandy commented Dec 13, 2025

Uh oh!

fujitatomoya left a comment

Uh oh!

fujitatomoya Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	time.sleep(0)
	os.sched_yield() if hasattr(os, 'sched_yield') else time.sleep(0)

Fix performance bug in MultiThreadedExecutor (hopefully) #1547

Are you sure you want to change the base?

Fix performance bug in MultiThreadedExecutor (hopefully) #1547

Conversation

michaeltandy commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Is this user-facing behavior change?

Did you use Generative AI?

Additional Information

Uh oh!

wjwwood commented Dec 5, 2025

Uh oh!

LSchi-nexat commented Dec 11, 2025

Uh oh!

michaeltandy commented Dec 11, 2025

Uh oh!

mjcarroll commented Dec 12, 2025

Uh oh!

michaeltandy commented Dec 13, 2025

Uh oh!

fujitatomoya left a comment

Choose a reason for hiding this comment

Uh oh!

fujitatomoya Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

michaeltandy commented Nov 25, 2025 •

edited

Loading