Skip to content

fix(cie): upgrade executor when agnocast entities arrive after callback group registration#1267

Draft
atsushi421 wants to merge 3 commits intomainfrom
fix-executor-race
Draft

fix(cie): upgrade executor when agnocast entities arrive after callback group registration#1267
atsushi421 wants to merge 3 commits intomainfrom
fix-executor-race

Conversation

@atsushi421
Copy link
Copy Markdown
Collaborator

@atsushi421 atsushi421 commented Apr 18, 2026

Description

In CallbackIsolatedAgnocastExecutor::spawn_child_executor and both ComponentManagerCallbackIsolated::start_executor_for_callback_group implementations, the executor type (SingleThreadedExecutor vs SingleThreadedAgnocastExecutor) is decided at the moment a callback group is first detected. If agnocast entities are added to a callback group after the executor has already been spawned, the SingleThreadedExecutor is never replaced, and the agnocast entities silently fail to execute.

This PR adds a monitoring-time upgrade mechanism:

  • The monitoring loop periodically re-checks callback groups that were assigned a SingleThreadedExecutor using dynamic_pointer_cast.
  • When get_agnocast_topics_by_group() returns non-empty for such a group, the old executor is cancelled, its thread is joined, and the group is re-spawned with a SingleThreadedAgnocastExecutor.
  • This avoids always using SingleThreadedAgnocastExecutor (which adds ~50ms latency per spin iteration for ROS-only groups due to get_next_agnocast_executable blocking).

Key design decisions:

  • Deadlock prevention: In CallbackIsolatedAgnocastExecutor, the upgrade loop only checks pre-existing entries (recorded before spawning new groups). Newly spawned executors may not have entered spin() yet, so cancelling them could deadlock because cancel() sets spinning=false but spin() resets it via spinning.exchange(true).
  • Thread join ordering: In CallbackIsolatedAgnocastExecutor, threads are joined outside child_resources_mutex_ to avoid deadlock (child threads' callbacks may acquire this mutex). In the component containers, the existing cancel_executor pattern (join under lock) is reused since the child threads don't acquire executor_wrappers_mutex_.
  • WeakPtr for ownership: ExecutorWrapper fields (callback_group_, node_) use WeakPtr to avoid extending object lifetimes, consistent with CallbackIsolatedAgnocastExecutor's child_nodes_ pattern.
  • Explicit flag clearing: get_associated_with_executor_atomic().store(false) is called explicitly after the old executor is destroyed, as a defensive measure (the rclcpp::Executor destructor also clears this flag).

Related links

How was this PR tested?

  • Autoware (required)
  • bash scripts/test/e2e_test_1to1.bash (required)
  • bash scripts/test/e2e_test_2to2.bash (required)
  • kunit tests (required when modifying the kernel module)
  • bash scripts/test/run_requires_kernel_module_tests.bash (required)
  • sample application

Notes for reviewers

Version Update Label (Required)

Please add exactly one of the following labels to this PR:

  • need-major-update: User API breaking changes
  • need-minor-update: Internal API breaking changes (heaphook/kmod/agnocastlib compatibility)
  • need-patch-update: Bug fixes and other changes

Important notes:

  • If you need need-major-update or need-minor-update, please include this in the PR title as well.
    • Example: fix(foo)[needs major version update]: bar or feat(baz)[needs minor version update]: qux
  • After receiving approval from reviewers, add the run-build-test label. The PR can only be merged after the build tests pass.

See CONTRIBUTING.md for detailed versioning rules.

…ck group registration

When a callback group is detected by the monitoring loop before any
agnocast entities are added to it, a plain SingleThreadedExecutor is
spawned. If agnocast subscriptions are subsequently added, they never
execute because the wrong executor type was selected.

Fix this by periodically re-checking callback groups that were assigned a
SingleThreadedExecutor. When agnocast topics appear, the old executor is
stopped and replaced with a SingleThreadedAgnocastExecutor. This avoids
always using SingleThreadedAgnocastExecutor (which adds 50ms latency per
spin for ROS-only groups) while ensuring late-arriving agnocast entities
are handled correctly.

Closes #1263

Signed-off-by: atsushi421 <atsushi.yano.2@tier4.jp>
- Prevent potential deadlock by limiting upgrade loop to pre-existing
  entries (skip just-spawned executors that may not have entered spin())
- Change ExecutorWrapper fields to WeakPtr for consistent ownership model
- Add RCLCPP_WARN when node expires during executor upgrade
- Unify log severity to RCLCPP_WARN across both component container copies

Signed-off-by: atsushi421 <atsushi.yano.2@tier4.jp>
@atsushi421 atsushi421 requested a review from Copilot April 18, 2026 13:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race in callback-isolated execution where callback groups that initially look “ROS-only” can later receive agnocast entities, requiring a switch to SingleThreadedAgnocastExecutor without paying the agnocast spin penalty for all groups.

Changes:

  • Track node ownership per child executor / wrapper so groups can be re-spawned with the correct executor type later.
  • Add monitoring-time “upgrade” logic: if agnocast topics appear for a group currently served by a plain SingleThreadedExecutor, cancel/join that executor thread and replace it with SingleThreadedAgnocastExecutor.
  • Apply the same upgrade behavior in both agnocast_components and deprecated agnocastlib component-container implementations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/agnocastlib/src/agnocast_component_container_cie.cpp Adds callback-group/node tracking to executor wrappers and upgrades ROS-only executors to agnocast-aware ones when topics appear.
src/agnocastlib/src/agnocast_callback_isolated_executor.cpp Tracks child nodes and adds upgrade flow in the monitoring loop to replace ROS-only child executors when agnocast entities appear later.
src/agnocastlib/include/agnocast/agnocast_callback_isolated_executor.hpp Adds child_nodes_ to keep node↔group↔executor vectors aligned for upgrade/stop operations.
src/agnocast_components/src/agnocast_component_container_cie.cpp Same upgrade mechanism as the deprecated copy, with wrapper fields for callback group and node.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +276 to +278

if (node) {
start_executor_for_callback_group(node_id, callback_group, node);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c2f9c76. Same fix as the non-deprecated variant: added explicit callback_group->get_associated_with_executor_atomic().store(false) after erase and before re-spawning.

Comment on lines +196 to +247
auto agnocast_topics = agnocast::get_agnocast_topics_by_group(group);
if (agnocast_topics.empty()) {
++i;
continue;
}

// Agnocast entities appeared — schedule upgrade
RCLCPP_INFO(
logger,
"Agnocast topics detected in callback group previously assigned a ROS-only executor. "
"Upgrading to SingleThreadedAgnocastExecutor.");

auto node = child_nodes_[i].lock();
executor->cancel();

UpgradeInfo info;
info.group = group;
info.node = node;
info.thread = std::move(child_threads_[i]);
upgrades.push_back(std::move(info));

auto idx = static_cast<std::ptrdiff_t>(i);
child_callback_groups_.erase(child_callback_groups_.begin() + idx);
child_nodes_.erase(child_nodes_.begin() + idx);
weak_child_executors_.erase(weak_child_executors_.begin() + idx);
child_threads_.erase(child_threads_.begin() + idx);
--pre_existing_count;
// Don't increment i; the next element shifted into this position
}
}

std::lock_guard<std::mutex> guard{child_resources_mutex_};
if (!spinning.load() || !rclcpp::ok()) {
break;
// Join old threads outside the lock to avoid deadlock
for (auto & upgrade : upgrades) {
if (upgrade.thread.joinable()) {
upgrade.thread.join();
}
}
for (auto & [group, node] : new_groups) {
if (group->get_associated_with_executor_atomic().load()) {
continue;

// Re-spawn upgraded groups with SingleThreadedAgnocastExecutor
if (!upgrades.empty()) {
std::lock_guard<std::mutex> guard{child_resources_mutex_};
if (!spinning.load() || !rclcpp::ok()) {
break;
}
for (auto & upgrade : upgrades) {
if (upgrade.node) {
spawn_child_executor(upgrade.group, upgrade.node);
} else {
RCLCPP_WARN(
logger,
"Node expired during executor upgrade; callback group will no longer be served.");
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c2f9c76. Added explicit get_associated_with_executor_atomic().store(false) after thread join and before re-spawning. The rclcpp::Executor destructor already clears this flag when the old executor is destroyed, but the explicit clear serves as defensive programming and makes the intent self-documenting.

Comment on lines +148 to +242
// Upgrade ROS-only executors that now have agnocast topics
struct UpgradeInfo
{
rclcpp::CallbackGroup::SharedPtr group;
rclcpp::node_interfaces::NodeBaseInterface::SharedPtr node;
std::thread thread;
};
std::vector<UpgradeInfo> upgrades;

{
std::lock_guard<std::mutex> guard{child_resources_mutex_};
if (!spinning.load() || !rclcpp::ok()) {
break;
}

// Record the count before spawning so the upgrade loop only checks pre-existing entries.
// Newly spawned executors may not have entered spin() yet, so cancelling them could
// deadlock (cancel sets spinning=false, but spin() resets it to true).
auto pre_existing_count = child_callback_groups_.size();

// Spawn executors for newly discovered callback groups
for (auto & [group, node] : new_groups) {
if (group->get_associated_with_executor_atomic().load()) {
continue;
}
spawn_child_executor(group, node);
}

// Check existing ROS-only executors for late-arriving agnocast entities
for (size_t i = 0; i < pre_existing_count;) {
auto group = child_callback_groups_[i].lock();
if (!group) {
++i;
continue;
}

auto executor = weak_child_executors_[i].lock();
if (!executor) {
++i;
continue;
}

// Only check groups running under a plain SingleThreadedExecutor
if (!std::dynamic_pointer_cast<rclcpp::executors::SingleThreadedExecutor>(executor)) {
++i;
continue;
}

auto agnocast_topics = agnocast::get_agnocast_topics_by_group(group);
if (agnocast_topics.empty()) {
++i;
continue;
}

// Agnocast entities appeared — schedule upgrade
RCLCPP_INFO(
logger,
"Agnocast topics detected in callback group previously assigned a ROS-only executor. "
"Upgrading to SingleThreadedAgnocastExecutor.");

auto node = child_nodes_[i].lock();
executor->cancel();

UpgradeInfo info;
info.group = group;
info.node = node;
info.thread = std::move(child_threads_[i]);
upgrades.push_back(std::move(info));

auto idx = static_cast<std::ptrdiff_t>(i);
child_callback_groups_.erase(child_callback_groups_.begin() + idx);
child_nodes_.erase(child_nodes_.begin() + idx);
weak_child_executors_.erase(weak_child_executors_.begin() + idx);
child_threads_.erase(child_threads_.begin() + idx);
--pre_existing_count;
// Don't increment i; the next element shifted into this position
}
}

std::lock_guard<std::mutex> guard{child_resources_mutex_};
if (!spinning.load() || !rclcpp::ok()) {
break;
// Join old threads outside the lock to avoid deadlock
for (auto & upgrade : upgrades) {
if (upgrade.thread.joinable()) {
upgrade.thread.join();
}
}
for (auto & [group, node] : new_groups) {
if (group->get_associated_with_executor_atomic().load()) {
continue;

// Re-spawn upgraded groups with SingleThreadedAgnocastExecutor
if (!upgrades.empty()) {
std::lock_guard<std::mutex> guard{child_resources_mutex_};
if (!spinning.load() || !rclcpp::ok()) {
break;
}
for (auto & upgrade : upgrades) {
if (upgrade.node) {
spawn_child_executor(upgrade.group, upgrade.node);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upgrade path requires the agnocast kernel module to register agnocast entities at runtime, which is not available in the unit test environment. The existing integration tests (test_agnocast_only_callback_isolated_executor) also require the kernel module. The fix has been verified through code review and build testing.

Comment on lines +275 to +279
cancel_executor(wrapper);
it = executor_wrappers.erase(it);

if (node) {
start_executor_for_callback_group(node_id, callback_group, node);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c2f9c76. Added explicit callback_group->get_associated_with_executor_atomic().store(false) after erase(it) destroys the old executor wrapper and before calling start_executor_for_callback_group().

The rclcpp::Executor destructor already clears this flag, but adding
explicit store(false) makes the intent self-documenting and guards
against edge cases in executor lifetime ordering.

Signed-off-by: atsushi421 <atsushi.yano.2@tier4.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants