Checklist
Description
A random crash of the autoware_behavior_path_planner node sometimes appears when a new planning goal is published. During the everyday operations we're facing this issue roughly once a day. It happens randomly, but annoys operators very much because the only way to fix it is to restart the stack completely.
Expected behavior
The autoware_behavior_path_planner is not crashing after a new planning goal is published.
Actual behavior
A random crash of the autoware_behavior_path_planner node sometimes appears when a new planning goal is published.
Logs of the crash:
[component_container_mt-62] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[component_container_mt-62] what(): failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
[component_container_mt-62] *** Aborted at 1775119751 (unix time) try "date -d @1775119751" if you are using GNU date ***
[component_container_mt-62] PC: @ 0x0 (unknown)
[component_container_mt-62] *** SIGABRT (@0x3e800000383) received by PID 899 (TID 0x7ed64f7fe640) from PID 899; stack trace: ***
[component_container_mt-62] @ 0x7ed65f7644d6 google::(anonymous namespace)::FailureSignalHandler()
[component_container_mt-62] @ 0x7ed6622df520 (unknown)
[component_container_mt-62] @ 0x7ed6623339fc pthread_kill
[component_container_mt-62] @ 0x7ed6622df476 raise
[component_container_mt-62] @ 0x7ed6622c57f3 abort
[component_container_mt-62] @ 0x7ed66258ab9e (unknown)
[component_container_mt-62] @ 0x7ed66259620c (unknown)
[component_container_mt-62] @ 0x7ed662596277 std::terminate()
[component_container_mt-62] @ 0x7ed6625961fe std::rethrow_exception()
[component_container_mt-62] @ 0x7ed66282e519 rclcpp::exceptions::throw_from_rcl_error()
[component_container_mt-62] @ 0x7ed66283231c rclcpp::detail::add_guard_condition_to_rcl_wait_set()
[component_container_mt-62] @ 0x7ed662844e27 (unknown)
[component_container_mt-62] @ 0x7ed662831b05 rclcpp::Executor::wait_for_work()
[component_container_mt-62] @ 0x7ed662834883 rclcpp::Executor::get_next_executable()
[component_container_mt-62] @ 0x7ed66283b7e2 rclcpp::executors::MultiThreadedExecutor::run()
[component_container_mt-62] @ 0x7ed6625c4253 (unknown)
[component_container_mt-62] @ 0x7ed662331ac3 (unknown)
[component_container_mt-62] @ 0x7ed6623c38c0 (unknown)
[ERROR] [component_container_mt-62]: process has died [pid 899, exit code -6, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args -r __node:=behavior_planning_container -r __ns:=/planning/scenario_planning/lane_driving/behavior_planning -p use_sim_time:=False -p wheel_radius:=0.305 -p wheel_width:=0.185 -p wheel_base:=2.755 -p wheel_tread:=1.469 -p front_overhang:=0.721 -p rear_overhang:=0.666 -p left_overhang:=0.128 -p right_overhang:=0.128 -p vehicle_height:=2.44 -p max_steer_angle:=0.7 -p use_sim_time:=False'].
Steps to reproduce
- Launch the planning_simulator.
- Set the vehicle initial pose.
- Launch script to publish the planning goal.
- Wait till the behavior_path_planner dies. It always happens.
Limiting the docker container cpu resources will help.
The planning_goal_abuse_runner test tool.
#!/usr/bin/env python3
"""
Standalone runner for the planning goal abuse scenario.
Procedure:
1. Wait for behavior_path_planner and mission_planner to come online.
2. Publish the vehicle initial pose once.
3. Continuously publish planning goals at random intervals.
4. Monitor required nodes; exit(1) if any node disappears (crash detected).
"""
import random
import sys
import threading
import time
import rclpy
from geometry_msgs.msg import PoseStamped, PoseWithCovarianceStamped
STARTUP_TIMEOUT = 120.0 # seconds to wait for required nodes after launch
CHECK_INTERVAL = 0.2 # seconds between liveness checks
GOAL_MIN_INTERVAL = 0.25 # minimum random delay between goal publications (s)
GOAL_MAX_INTERVAL = 1.5 # maximum random delay between goal publications (s)
INITIAL_POSE_TOPIC = "/initialpose"
GOAL_TOPIC = "/planning/mission_planning/goal"
MAP_FRAME = "map"
INITIAL_POSE = dict(
x = 219.064,
y = 579.993,
z = 20.311,
ox = 0.0,
oy = 0.0,
oz = 0.838,
ow = 0.545,
)
GOAL_POSE = dict(
x = 133.534,
y = 778.934,
z = 0.0,
ox = 0.0,
oy = 0.0,
oz = 0.831,
ow = 0.556,
)
REQUIRED_NODES = [
"behavior_path_planner",
"mission_planner",
]
def _wait_for_nodes(node, node_names, timeout):
"""Block until every name in *node_names* appears in the ROS 2 graph."""
deadline = time.time() + timeout
while time.time() < deadline:
alive = node.get_node_names()
missing = [n for n in node_names if n not in alive]
if not missing:
node.get_logger().info(f"All required nodes online: {node_names}")
return
node.get_logger().info(
f"Waiting for nodes: {missing}",
throttle_duration_sec=5.0,
)
time.sleep(1.0)
alive = node.get_node_names()
missing = [n for n in node_names if n not in alive]
raise TimeoutError(
f"Required nodes not available after {timeout} s. Missing: {missing}"
)
def _publish_initial_pose(node, pub):
msg = PoseWithCovarianceStamped()
msg.header.stamp = node.get_clock().now().to_msg()
msg.header.frame_id = MAP_FRAME
msg.pose.pose.position.x = INITIAL_POSE["x"]
msg.pose.pose.position.y = INITIAL_POSE["y"]
msg.pose.pose.position.z = INITIAL_POSE["z"]
msg.pose.pose.orientation.x = INITIAL_POSE["ox"]
msg.pose.pose.orientation.y = INITIAL_POSE["oy"]
msg.pose.pose.orientation.z = INITIAL_POSE["oz"]
msg.pose.pose.orientation.w = INITIAL_POSE["ow"]
msg.pose.covariance = [0.0] * 36
msg.pose.covariance[0] = 0.25 # x variance
msg.pose.covariance[7] = 0.25 # y variance
msg.pose.covariance[35] = 0.06854 # yaw variance
pub.publish(msg)
node.get_logger().info("Published initial pose.")
def _goal_abuse_loop(node, pub, stop_event):
"""Publish planning goals at random intervals until *stop_event* is set."""
while not stop_event.is_set():
msg = PoseStamped()
msg.header.stamp = node.get_clock().now().to_msg()
msg.header.frame_id = MAP_FRAME
msg.pose.position.x = GOAL_POSE["x"]
msg.pose.position.y = GOAL_POSE["y"]
msg.pose.position.z = GOAL_POSE["z"]
msg.pose.orientation.x = GOAL_POSE["ox"]
msg.pose.orientation.y = GOAL_POSE["oy"]
msg.pose.orientation.z = GOAL_POSE["oz"]
msg.pose.orientation.w = GOAL_POSE["ow"]
interval = random.uniform(GOAL_MIN_INTERVAL, GOAL_MAX_INTERVAL)
node.get_logger().info(
f"Publishing planning goal (next in {interval:.2f} s)..."
)
pub.publish(msg)
stop_event.wait(timeout=interval)
def main():
rclpy.init()
node = rclpy.create_node("planning_goal_abuse_runner")
initial_pose_pub = node.create_publisher(
PoseWithCovarianceStamped, INITIAL_POSE_TOPIC, 1
)
goal_pub = node.create_publisher(PoseStamped, GOAL_TOPIC, 10)
spin_thread = threading.Thread(
target=rclpy.spin, args=(node,), daemon=True
)
spin_thread.start()
exit_code = 0
stop_event = threading.Event()
abuse_thread = None
try:
# Step 1 — wait for the planning stack.
_wait_for_nodes(node, REQUIRED_NODES, timeout=STARTUP_TIMEOUT)
# Step 2 — publish initial pose once and let the system settle.
_publish_initial_pose(node, initial_pose_pub)
time.sleep(1.0)
# Step 3 — start goal abuse in a background thread.
abuse_thread = threading.Thread(
target=_goal_abuse_loop,
args=(node, goal_pub, stop_event),
daemon=True,
)
abuse_thread.start()
# Step 4 — monitor required nodes; exit if any disappear.
node.get_logger().info(
f"Monitoring {REQUIRED_NODES} for crashes..."
)
monitor_start = time.time()
while True:
alive = node.get_node_names()
missing = [n for n in REQUIRED_NODES if n not in alive]
if missing:
uptime = time.time() - monitor_start
node.get_logger().error(
f"Node(s) disappeared (crash detected): {missing} "
f"after {uptime:.1f} s of monitoring"
)
exit_code = 1
break
time.sleep(CHECK_INTERVAL)
except KeyboardInterrupt:
node.get_logger().info("Interrupted by user.")
except Exception as e:
node.get_logger().error(f"Fatal error: {e}")
exit_code = 1
finally:
stop_event.set()
if abuse_thread is not None:
abuse_thread.join(timeout=GOAL_MAX_INTERVAL + 1.0)
node.destroy_node()
rclpy.shutdown()
sys.exit(exit_code)
if __name__ == "__main__":
main()
Versions
- Ubuntu 22.04
- ROS2 Humble
- Autoware version: 0.48, 0.49, 0.50
Possible causes
The crash is caused by the deleted lane_parking_timer_cb_group_ and freespace_parking_timer_cb_group_ callback groups. When a planning goal is published, the GoalPlannerModule instance is deleter with its callback groups. But weak pointers to these callback groups are stored in the Node Executor and might be suddenly accessed by this weak pointer. The callback groups suppose to stay alive and be available whole time while the node and executor is running.
The crash happens in the situation when the node executor tries to lock a timer callback group using a weak smart pointer, fails and throw exception. Because of this exception the whole executor and behavior path planner dies.
Additional context
Any callback groups in ROS2 should never be deleted while the node is running.
But there is a quick hack that can help with a crash. The freespace_parking_timer_cb_group_ and lane_parking_timer_cb_group_ can be static class properties. This fixed the issue for me.
Changes in the goal_planner_module.hpp:
static rclcpp::CallbackGroup::SharedPtr lane_parking_timer_cb_group_;
static rclcpp::CallbackGroup::SharedPtr freespace_parking_timer_cb_group_;
Changes in the goal_planner_module.cpp:
rclcpp::CallbackGroup::SharedPtr GoalPlannerModule::lane_parking_timer_cb_group_ = nullptr;
rclcpp::CallbackGroup::SharedPtr GoalPlannerModule::freespace_parking_timer_cb_group_ = nullptr;
// --------------------------------------------------------------------------
if (!lane_parking_timer_cb_group_) {
lane_parking_timer_cb_group_ = node.create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive);
}
// --------------------------------------------------------------------------
if (!freespace_parking_timer_cb_group_) {
freespace_parking_timer_cb_group_ = node.create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive);
}
Checklist
Description
A random crash of the
autoware_behavior_path_plannernode sometimes appears when a new planning goal is published. During the everyday operations we're facing this issue roughly once a day. It happens randomly, but annoys operators very much because the only way to fix it is to restart the stack completely.Expected behavior
The
autoware_behavior_path_planneris not crashing after a new planning goal is published.Actual behavior
A random crash of the
autoware_behavior_path_plannernode sometimes appears when a new planning goal is published.Logs of the crash:
Steps to reproduce
Limiting the docker container cpu resources will help.
The planning_goal_abuse_runner test tool.
Versions
Possible causes
The crash is caused by the deleted
lane_parking_timer_cb_group_andfreespace_parking_timer_cb_group_callback groups. When a planning goal is published, theGoalPlannerModuleinstance is deleter with its callback groups. But weak pointers to these callback groups are stored in the Node Executor and might be suddenly accessed by this weak pointer. The callback groups suppose to stay alive and be available whole time while the node and executor is running.The crash happens in the situation when the node executor tries to lock a timer callback group using a weak smart pointer, fails and throw exception. Because of this exception the whole executor and behavior path planner dies.
Additional context
Any callback groups in ROS2 should never be deleted while the node is running.
But there is a quick hack that can help with a crash. The
freespace_parking_timer_cb_group_andlane_parking_timer_cb_group_can be static class properties. This fixed the issue for me.Changes in the goal_planner_module.hpp:
Changes in the goal_planner_module.cpp: