Skip to content

Conversation

@michaeltandy
Copy link

@michaeltandy michaeltandy commented Nov 25, 2025

Description

Fixes #1452

This one-line change should improve the performance of MultiThreadedExecutor.

  • _wait_for_ready_callbacks yields, among other things, tasks that are in progress, if (not task.executing() and not task.done()
  • In a SingleThreadedExecutor this logic works fine - because yielded tasks are always completed before the next call to _wait_for_ready_callbacks
  • But in a MultiThreadedExecutor yielded tasks are merely submitted to a threadpool before the next call to _wait_for_ready_callbacks - there's no guarantee they'll have started executing. A task that's waiting in the threadpool queue can be returned and enqueued again.
  • What's more, so long as there's a task that hasn't started executing yet, _wait_for_ready_callbacks returns quickly, rather than waiting.
  • However, as task.py's __call__ method includes a check - if (not self._pending() or self._executing): return - the task doesn't actually run beyond that check repeatedly, despite being enqueued repeatedly.
  • The problem seems to be worse than you'd expect it to be by chance; I speculate that so long as the spin() thread holds the GIL, the thread pool executor's workers can't even begin to make progress.

Running the Zachary's example from #1452 yields the following difference:

Before

root@54d707e498b4:~/loopback# time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] [1764116020.181174240] [clock_sub_with_timer]: Node started with multithreaded executor
[INFO] [1764116020.654835033] [clock_sub_with_timer]: [clock] freq avg: 10.34 Hz | period avg: 96.7 ms ± 60.9 ms
[INFO] [1764116020.671057864] [clock_sub_with_timer]: [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 5.2 ms
[INFO] [1764116025.740914997] [clock_sub_with_timer]: [clock] freq avg: 11.21 Hz | period avg: 89.2 ms ± 99.1 ms
[INFO] [1764116025.742426391] [clock_sub_with_timer]: [timer] freq avg: 5.98 Hz | period avg: 167.2 ms ± 97.2 ms
[INFO] [1764116030.780169855] [clock_sub_with_timer]: [clock] freq avg: 12.34 Hz | period avg: 81.1 ms ± 159.8 ms
[INFO] [1764116030.781428037] [clock_sub_with_timer]: [timer] freq avg: 3.56 Hz | period avg: 281.1 ms ± 284.3 ms
[INFO] [1764116035.954285504] [clock_sub_with_timer]: [clock] freq avg: 6.63 Hz | period avg: 150.9 ms ± 224.4 ms
[INFO] [1764116035.957947068] [clock_sub_with_timer]: [timer] freq avg: 4.11 Hz | period avg: 243.1 ms ± 200.5 ms
[INFO] [1764116041.241116952] [clock_sub_with_timer]: [clock] freq avg: 7.62 Hz | period avg: 131.2 ms ± 196.3 ms
[INFO] [1764116041.242946988] [clock_sub_with_timer]: [timer] freq avg: 3.67 Hz | period avg: 272.8 ms ± 287.1 ms
[INFO] [1764116046.396494238] [clock_sub_with_timer]: [clock] freq avg: 6.29 Hz | period avg: 158.9 ms ± 176.0 ms
[INFO] [1764116046.398510055] [clock_sub_with_timer]: [timer] freq avg: 5.19 Hz | period avg: 192.8 ms ± 143.3 ms

real	0m30.127s
user	0m22.503s
sys	0m0.486s

22.503s/30.127s = 74.7% of a CPU core used, on average.

After

root@54d707e498b4:~/loopback# time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] [1764116169.923305987] [clock_sub_with_timer]: Node started with multithreaded executor
[INFO] [1764116170.344815876] [clock_sub_with_timer]: [clock] freq avg: 11.57 Hz | period avg: 86.4 ms ± 60.4 ms
[INFO] [1764116170.412870650] [clock_sub_with_timer]: [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.4 ms
[INFO] [1764116175.444432305] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [1764116175.512819612] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [1764116180.444488508] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116180.513344804] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116185.444916813] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [1764116185.612434555] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116190.544733712] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [1764116190.612827796] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [1764116195.613330672] [clock_sub_with_timer]: [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [1764116195.644247121] [clock_sub_with_timer]: [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms

real	0m30.127s
user	0m0.842s
sys	0m0.184s

0.842s/30.127s = 2.8% of a CPU core used, on average.

Is this user-facing behavior change?

No change.

Did you use Generative AI?

No generative AI was used.

Additional Information

Diff if you want the logging that revealed this issue to me:

diff --git a/rclpy/rclpy/executors.py b/rclpy/rclpy/executors.py
index 7d78e4e..00ae330 100644
--- a/rclpy/rclpy/executors.py
+++ b/rclpy/rclpy/executors.py
@@ -717,6 +717,7 @@ class Executor(ContextManager['Executor']):
                     if (not task.executing() and not task.done() and
                             (node is None or node in nodes_to_use)):
                         yielded_work = True
+                        print(f"Yielding queued task {task}")
                         yield task, entity, node
                 with self._tasks_lock:
                     # Get rid of any tasks that are done
@@ -877,6 +878,7 @@ class Executor(ContextManager['Executor']):
                         if sub.callback_group.can_execute(sub):
                             handler = self._make_handler(sub, node, self._take_subscription)
                             yielded_work = True
+                            print(f"Yielding subscription {handler}")
                             yield handler, sub, node
 
                 for gc in node.guards:
@@ -1065,6 +1067,7 @@ class MultiThreadedExecutor(Executor):
                     if future.done():
                         self._futures.remove(future)
                         future.result()  # raise any exceptions
+            print(f"Executor length {self._executor._work_queue.qsize()}")
 
     def spin_once(self, timeout_sec: Optional[float] = None) -> None:
         # Mark executor as spinning to prevent concurrent spins

As a test I ran the basic publisher and subscriber from the documentation but with the publisher using timer_period = 0.02 and the subscriber using a MultiThreadedExecutor:

def main(args=None):
    rclpy.init(args=args)
    minimal_publisher = MinimalPublisher()
    rclpy.spin(minimal_publisher)
    minimal_publisher.destroy_node()
    rclpy.shutdown()

Here's an excerpt of the logs, before the fix is applied:

Yielding subscription <rclpy.task.Task object at 0x705366eee990>
Executor length 1
I heard: "Hello World: 1327"
Yielding subscription <rclpy.task.Task object at 0x705366f3e9c0>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x705366f3e9c0>
I heard: "Hello World: 1328"
Executor length 1
Yielding subscription <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 3
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 4
I heard: "Hello World: 1329"
Yielding subscription <rclpy.task.Task object at 0x70536c6c2420>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 3
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 4
Yielding queued task <rclpy.task.Task object at 0x70536c6c2420>
Executor length 5
I heard: "Hello World: 1330"
Yielding subscription <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x7053672fc8c0>
Executor length 3
I heard: "Hello World: 1331"
Yielding subscription <rclpy.task.Task object at 0x705366e26f90>
Executor length 1
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 2
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 3
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 4
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 5
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 6
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 7
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 8
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 9
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 10
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 11
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
Executor length 12
Yielding queued task <rclpy.task.Task object at 0x705366e26f90>
I heard: "Hello World: 1332"
Executor length 1
Yielding subscription <rclpy.task.Task object at 0x705366eee990>
Executor length 1

@michaeltandy michaeltandy force-pushed the MultiThreadedExecutor-performance-bug-fix branch from 1e061ef to 6b15da7 Compare November 25, 2025 23:55
@michaeltandy michaeltandy force-pushed the MultiThreadedExecutor-performance-bug-fix branch from 6b15da7 to 0fb9834 Compare November 25, 2025 23:56
@wjwwood
Copy link
Member

wjwwood commented Dec 5, 2025

From issue triage meeting: I put this on the agenda for the next Client Library Working Group meeting.

@LSchi-nexat
Copy link

We had to use 0.01 instead of 0 for this fix to work.

@wjwwood
This is extremely important. This reduces CPU load of some of our nodes by a factor of 10!
It also applies to the single threaded executor.

Not sure what the underlying issue is, but the CPU performance boost is incredible.

@michaeltandy
Copy link
Author

@LSchi-nexat Interesting, glad it's worked for you! For me, the sleep of zero worked OK. Can you tell me how many topics you're subscribed to, the message rates of the various topics, and whether you're doing any CPU-intensive processing of any of the messages?

Using 0.01 (i.e. 10 milliseconds) of course means any messages arriving faster than 100 Hz might get dropped. So if you've got 1000 Hz topics, some of your CPU saving might be coming from just not not processing a bunch of messages.

@mjcarroll
Copy link
Member

Discussion from Client Library Working Group.

Based on researching during the meeting, we are fairly sure that @michaeltandy's analysis is correct. Since this PR has been opened, we have also landed some updates to the logic in the Executor based class here (#1469) which could potentially have impact on the behavior.

It may be that this thread yield is still needed, but before we land it, we are going to check to see if 1469 has any impact.

Actions:

  • I was unable to reproduce the performance on my system, so @michaeltandy is going to try to repro on his system by building rclpy from source.
  • I will release rclpy so that we can also test the binaries.

@michaeltandy
Copy link
Author

I've tested, and @mjcarroll is right to say the current rolling branch has resolved the yield-the-same-task-repeatedly bug, in 9695271e providing a big boost in performance. However, there's still a further (modest) performance improvement available with the addition of the time.sleep(0)

As of 10.0.2

[INFO] Node started with multithreaded executor
[INFO] [clock] freq avg: 10.34 Hz | period avg: 96.7 ms ± 60.9 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 5.2 ms
[INFO] [clock] freq avg: 11.21 Hz | period avg: 89.2 ms ± 99.1 ms
[INFO] [timer] freq avg: 5.98 Hz | period avg: 167.2 ms ± 97.2 ms
[INFO] [clock] freq avg: 12.34 Hz | period avg: 81.1 ms ± 159.8 ms
[INFO] [timer] freq avg: 3.56 Hz | period avg: 281.1 ms ± 284.3 ms
[INFO] [clock] freq avg: 6.63 Hz | period avg: 150.9 ms ± 224.4 ms
[INFO] [timer] freq avg: 4.11 Hz | period avg: 243.1 ms ± 200.5 ms
[INFO] [clock] freq avg: 7.62 Hz | period avg: 131.2 ms ± 196.3 ms
[INFO] [timer] freq avg: 3.67 Hz | period avg: 272.8 ms ± 287.1 ms
[INFO] [clock] freq avg: 6.29 Hz | period avg: 158.9 ms ± 176.0 ms
[INFO] [timer] freq avg: 5.19 Hz | period avg: 192.8 ms ± 143.3 ms

Mostly fixed by 9695271e

$ /usr/bin/time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] Node started with multithreaded executor
[INFO] [clock] freq avg: 11.12 Hz | period avg: 89.9 ms ± 64.5 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.3 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [clock] freq avg: 9.98 Hz | period avg: 100.2 ms ± 5.5 ms
[INFO] [clock] freq avg: 9.93 Hz | period avg: 100.7 ms ± 3.1 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [clock] freq avg: 10.04 Hz | period avg: 99.6 ms ± 14.7 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [clock] freq avg: 9.91 Hz | period avg: 100.9 ms ± 5.5 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.3 ms
[INFO] [clock] freq avg: 10.17 Hz | period avg: 98.4 ms ± 12.6 ms
Command exited with non-zero status 124
2.06user 0.29system 0:30.12elapsed 7%CPU (0avgtext+0avgdata 69504maxresident)k
0inputs+32outputs (0major+9436minor)pagefaults 0swaps

And slightly better again the the addition of the sleep(0)

$ /usr/bin/time timeout 30s python3 bug1452-clocksubscriber.py
[INFO] Node started with multithreaded executor
[INFO] [clock] freq avg: 11.88 Hz | period avg: 84.1 ms ± 61.2 ms
[INFO] [timer] freq avg: 9.99 Hz | period avg: 100.1 ms ± 0.5 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.1 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
[INFO] [timer] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.3 ms
[INFO] [clock] freq avg: 10.00 Hz | period avg: 100.0 ms ± 0.2 ms
Command exited with non-zero status 124
0.84user 0.29system 0:30.12elapsed 3%CPU (0avgtext+0avgdata 69120maxresident)k
0inputs+32outputs (0major+9438minor)pagefaults 0swaps

Still an improvement, but not the three-orders-of-magnitude I claimed in the pull request.

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaeltandy thanks a lot for sharing the information. i believe your analysis is correct, and this PR fixes that problematic situation in the MultiThreadedExecutor.

future.result() # raise any exceptions

# Yield GIL so executor threads have a chance to run.
time.sleep(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this? can be more efficient. it directly tells the OS scheduler to yield the CPU to another thread without any sleep/timer overhead, and it clearly communicates the intent (yield CPU) rather than sleeping.

Suggested change
time.sleep(0)
os.sched_yield() if hasattr(os, 'sched_yield') else time.sleep(0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance issues with MultiThreadedExecutor

5 participants