fix(cie): upgrade executor when agnocast entities arrive after callback group registration by atsushi421 · Pull Request #1267 · autowarefoundation/agnocast

atsushi421 · 2026-04-18T13:10:46Z

Description

In CallbackIsolatedAgnocastExecutor::spawn_child_executor and both ComponentManagerCallbackIsolated::start_executor_for_callback_group implementations, the executor type (SingleThreadedExecutor vs SingleThreadedAgnocastExecutor) is decided at the moment a callback group is first detected. If agnocast entities are added to a callback group after the executor has already been spawned, the SingleThreadedExecutor is never replaced, and the agnocast entities silently fail to execute.

This PR adds a monitoring-time upgrade mechanism:

The monitoring loop periodically re-checks callback groups that were assigned a SingleThreadedExecutor using dynamic_pointer_cast.
When get_agnocast_topics_by_group() returns non-empty for such a group, the old executor is cancelled, its thread is joined, and the group is re-spawned with a SingleThreadedAgnocastExecutor.
This avoids always using SingleThreadedAgnocastExecutor (which adds ~50ms latency per spin iteration for ROS-only groups due to get_next_agnocast_executable blocking).

Key design decisions:

Deadlock prevention: In CallbackIsolatedAgnocastExecutor, the upgrade loop only checks pre-existing entries (recorded before spawning new groups). Newly spawned executors may not have entered spin() yet, so cancelling them could deadlock because cancel() sets spinning=false but spin() resets it via spinning.exchange(true).
Thread join ordering: In CallbackIsolatedAgnocastExecutor, threads are joined outside child_resources_mutex_ to avoid deadlock (child threads' callbacks may acquire this mutex). In the component containers, the existing cancel_executor pattern (join under lock) is reused since the child threads don't acquire executor_wrappers_mutex_.
WeakPtr for ownership: ExecutorWrapper fields (callback_group_, node_) use WeakPtr to avoid extending object lifetimes, consistent with CallbackIsolatedAgnocastExecutor's child_nodes_ pattern.
Explicit flag clearing: get_associated_with_executor_atomic().store(false) is called explicitly after the old executor is destroyed, as a defensive measure (the rclcpp::Executor destructor also clears this flag).

How was this PR tested?

Autoware (required)
bash scripts/test/e2e_test_1to1.bash (required)
bash scripts/test/e2e_test_2to2.bash (required)
kunit tests (required when modifying the kernel module)
bash scripts/test/run_requires_kernel_module_tests.bash (required)
sample application

Notes for reviewers

Version Update Label (Required)

Please add exactly one of the following labels to this PR:

need-major-update: User API breaking changes
need-minor-update: Internal API breaking changes (heaphook/kmod/agnocastlib compatibility)
need-patch-update: Bug fixes and other changes

Important notes:

If you need need-major-update or need-minor-update, please include this in the PR title as well.
- Example: fix(foo)[needs major version update]: bar or feat(baz)[needs minor version update]: qux
After receiving approval from reviewers, add the run-build-test label. The PR can only be merged after the build tests pass.

See CONTRIBUTING.md for detailed versioning rules.

…ck group registration When a callback group is detected by the monitoring loop before any agnocast entities are added to it, a plain SingleThreadedExecutor is spawned. If agnocast subscriptions are subsequently added, they never execute because the wrong executor type was selected. Fix this by periodically re-checking callback groups that were assigned a SingleThreadedExecutor. When agnocast topics appear, the old executor is stopped and replaced with a SingleThreadedAgnocastExecutor. This avoids always using SingleThreadedAgnocastExecutor (which adds 50ms latency per spin for ROS-only groups) while ensuring late-arriving agnocast entities are handled correctly. Closes #1263 Signed-off-by: atsushi421 <atsushi.yano.2@tier4.jp>

- Prevent potential deadlock by limiting upgrade loop to pre-existing entries (skip just-spawned executors that may not have entered spin()) - Change ExecutorWrapper fields to WeakPtr for consistent ownership model - Add RCLCPP_WARN when node expires during executor upgrade - Unify log severity to RCLCPP_WARN across both component container copies Signed-off-by: atsushi421 <atsushi.yano.2@tier4.jp>

Copilot

Pull request overview

This PR fixes a race in callback-isolated execution where callback groups that initially look “ROS-only” can later receive agnocast entities, requiring a switch to SingleThreadedAgnocastExecutor without paying the agnocast spin penalty for all groups.

Changes:

Track node ownership per child executor / wrapper so groups can be re-spawned with the correct executor type later.
Add monitoring-time “upgrade” logic: if agnocast topics appear for a group currently served by a plain SingleThreadedExecutor, cancel/join that executor thread and replace it with SingleThreadedAgnocastExecutor.
Apply the same upgrade behavior in both agnocast_components and deprecated agnocastlib component-container implementations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
src/agnocastlib/src/agnocast_component_container_cie.cpp	Adds callback-group/node tracking to executor wrappers and upgrades ROS-only executors to agnocast-aware ones when topics appear.
src/agnocastlib/src/agnocast_callback_isolated_executor.cpp	Tracks child nodes and adds upgrade flow in the monitoring loop to replace ROS-only child executors when agnocast entities appear later.
src/agnocastlib/include/agnocast/agnocast_callback_isolated_executor.hpp	Adds `child_nodes_` to keep node↔group↔executor vectors aligned for upgrade/stop operations.
src/agnocast_components/src/agnocast_component_container_cie.cpp	Same upgrade mechanism as the deprecated copy, with wrapper fields for callback group and node.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

atsushi421 · 2026-04-18T14:09:52Z

+
+      if (node) {
+        start_executor_for_callback_group(node_id, callback_group, node);


Fixed in c2f9c76. Same fix as the non-deprecated variant: added explicit callback_group->get_associated_with_executor_atomic().store(false) after erase and before re-spawning.

atsushi421 · 2026-04-18T14:09:46Z

+        auto agnocast_topics = agnocast::get_agnocast_topics_by_group(group);
+        if (agnocast_topics.empty()) {
+          ++i;
+          continue;
+        }
+
+        // Agnocast entities appeared — schedule upgrade
+        RCLCPP_INFO(
+          logger,
+          "Agnocast topics detected in callback group previously assigned a ROS-only executor. "
+          "Upgrading to SingleThreadedAgnocastExecutor.");
+
+        auto node = child_nodes_[i].lock();
+        executor->cancel();
+
+        UpgradeInfo info;
+        info.group = group;
+        info.node = node;
+        info.thread = std::move(child_threads_[i]);
+        upgrades.push_back(std::move(info));
+
+        auto idx = static_cast<std::ptrdiff_t>(i);
+        child_callback_groups_.erase(child_callback_groups_.begin() + idx);
+        child_nodes_.erase(child_nodes_.begin() + idx);
+        weak_child_executors_.erase(weak_child_executors_.begin() + idx);
+        child_threads_.erase(child_threads_.begin() + idx);
+        --pre_existing_count;
+        // Don't increment i; the next element shifted into this position
+      }
    }

-    std::lock_guard<std::mutex> guard{child_resources_mutex_};
-    if (!spinning.load() || !rclcpp::ok()) {
-      break;
+    // Join old threads outside the lock to avoid deadlock
+    for (auto & upgrade : upgrades) {
+      if (upgrade.thread.joinable()) {
+        upgrade.thread.join();
+      }
    }
-    for (auto & [group, node] : new_groups) {
-      if (group->get_associated_with_executor_atomic().load()) {
-        continue;
+
+    // Re-spawn upgraded groups with SingleThreadedAgnocastExecutor
+    if (!upgrades.empty()) {
+      std::lock_guard<std::mutex> guard{child_resources_mutex_};
+      if (!spinning.load() || !rclcpp::ok()) {
+        break;
+      }
+      for (auto & upgrade : upgrades) {
+        if (upgrade.node) {
+          spawn_child_executor(upgrade.group, upgrade.node);
+        } else {
+          RCLCPP_WARN(
+            logger,
+            "Node expired during executor upgrade; callback group will no longer be served.");
+        }


Fixed in c2f9c76. Added explicit get_associated_with_executor_atomic().store(false) after thread join and before re-spawning. The rclcpp::Executor destructor already clears this flag when the old executor is destroyed, but the explicit clear serves as defensive programming and makes the intent self-documenting.

atsushi421 · 2026-04-18T14:09:49Z

+    // Upgrade ROS-only executors that now have agnocast topics
+    struct UpgradeInfo
+    {
+      rclcpp::CallbackGroup::SharedPtr group;
+      rclcpp::node_interfaces::NodeBaseInterface::SharedPtr node;
+      std::thread thread;
+    };
+    std::vector<UpgradeInfo> upgrades;
+
+    {
+      std::lock_guard<std::mutex> guard{child_resources_mutex_};
+      if (!spinning.load() || !rclcpp::ok()) {
+        break;
+      }
+
+      // Record the count before spawning so the upgrade loop only checks pre-existing entries.
+      // Newly spawned executors may not have entered spin() yet, so cancelling them could
+      // deadlock (cancel sets spinning=false, but spin() resets it to true).
+      auto pre_existing_count = child_callback_groups_.size();
+
+      // Spawn executors for newly discovered callback groups
+      for (auto & [group, node] : new_groups) {
+        if (group->get_associated_with_executor_atomic().load()) {
+          continue;
+        }
+        spawn_child_executor(group, node);
+      }
+
+      // Check existing ROS-only executors for late-arriving agnocast entities
+      for (size_t i = 0; i < pre_existing_count;) {
+        auto group = child_callback_groups_[i].lock();
+        if (!group) {
+          ++i;
+          continue;
+        }
+
+        auto executor = weak_child_executors_[i].lock();
+        if (!executor) {
+          ++i;
+          continue;
+        }
+
+        // Only check groups running under a plain SingleThreadedExecutor
+        if (!std::dynamic_pointer_cast<rclcpp::executors::SingleThreadedExecutor>(executor)) {
+          ++i;
+          continue;
+        }
+
+        auto agnocast_topics = agnocast::get_agnocast_topics_by_group(group);
+        if (agnocast_topics.empty()) {
+          ++i;
+          continue;
+        }
+
+        // Agnocast entities appeared — schedule upgrade
+        RCLCPP_INFO(
+          logger,
+          "Agnocast topics detected in callback group previously assigned a ROS-only executor. "
+          "Upgrading to SingleThreadedAgnocastExecutor.");
+
+        auto node = child_nodes_[i].lock();
+        executor->cancel();
+
+        UpgradeInfo info;
+        info.group = group;
+        info.node = node;
+        info.thread = std::move(child_threads_[i]);
+        upgrades.push_back(std::move(info));
+
+        auto idx = static_cast<std::ptrdiff_t>(i);
+        child_callback_groups_.erase(child_callback_groups_.begin() + idx);
+        child_nodes_.erase(child_nodes_.begin() + idx);
+        weak_child_executors_.erase(weak_child_executors_.begin() + idx);
+        child_threads_.erase(child_threads_.begin() + idx);
+        --pre_existing_count;
+        // Don't increment i; the next element shifted into this position
+      }
    }

-    std::lock_guard<std::mutex> guard{child_resources_mutex_};
-    if (!spinning.load() || !rclcpp::ok()) {
-      break;
+    // Join old threads outside the lock to avoid deadlock
+    for (auto & upgrade : upgrades) {
+      if (upgrade.thread.joinable()) {
+        upgrade.thread.join();
+      }
    }
-    for (auto & [group, node] : new_groups) {
-      if (group->get_associated_with_executor_atomic().load()) {
-        continue;
+
+    // Re-spawn upgraded groups with SingleThreadedAgnocastExecutor
+    if (!upgrades.empty()) {
+      std::lock_guard<std::mutex> guard{child_resources_mutex_};
+      if (!spinning.load() || !rclcpp::ok()) {
+        break;
+      }
+      for (auto & upgrade : upgrades) {
+        if (upgrade.node) {
+          spawn_child_executor(upgrade.group, upgrade.node);


The upgrade path requires the agnocast kernel module to register agnocast entities at runtime, which is not available in the unit test environment. The existing integration tests (test_agnocast_only_callback_isolated_executor) also require the kernel module. The fix has been verified through code review and build testing.

atsushi421 · 2026-04-18T14:09:51Z

+      cancel_executor(wrapper);
+      it = executor_wrappers.erase(it);
+
+      if (node) {
+        start_executor_for_callback_group(node_id, callback_group, node);


Fixed in c2f9c76. Added explicit callback_group->get_associated_with_executor_atomic().store(false) after erase(it) destroys the old executor wrapper and before calling start_executor_for_callback_group().

The rclcpp::Executor destructor already clears this flag, but adding explicit store(false) makes the intent self-documenting and guards against edge cases in executor lifetime ordering. Signed-off-by: atsushi421 <atsushi.yano.2@tier4.jp>

github-actions Bot assigned atsushi421 Apr 18, 2026

atsushi421 requested a review from Copilot April 18, 2026 13:21

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cie): upgrade executor when agnocast entities arrive after callback group registration#1267

fix(cie): upgrade executor when agnocast entities arrive after callback group registration#1267
atsushi421 wants to merge 3 commits intomainfrom
fix-executor-race

atsushi421 commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

atsushi421 Apr 18, 2026

Uh oh!

atsushi421 Apr 18, 2026

Uh oh!

atsushi421 Apr 18, 2026

Uh oh!

atsushi421 Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		if (node) {
		start_executor_for_callback_group(node_id, callback_group, node);

Conversation

atsushi421 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related links

How was this PR tested?

Notes for reviewers

Version Update Label (Required)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

atsushi421 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

atsushi421 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

atsushi421 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

atsushi421 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atsushi421 commented Apr 18, 2026 •

edited

Loading