-
Notifications
You must be signed in to change notification settings - Fork 287
Description
Description
We occasionally have an issue where a rosbag2 Recorder node segfaults and dies when stop() is called. This ends up truncating the .mcap file being written, resulting in it being corrupted because it lacks trailing magic bytes.
Expected Behavior
No segfault, the bag is closed and the .mcap has its trailing bytes written.
Actual Behavior
The process dies with code -11. Logs:
[INFO] [1730717198.622841633] [main_bag_rosbag_recorder]: Listening for topics... [rosbag_run_recorder_node-9]
[INFO] [1730717198.622848700] [main_bag_rosbag_recorder]: Event publisher thread: Starting [rosbag_run_recorder_node-9]
[INFO] [1730717198.823090723] [main_bag_rosbag_recorder]: Subscribed to topic '/visualization/machine_interface/target_terrain' [rosbag_run_recorder_node-9]
....
[INFO] [1730717199.253250365] [main_bag_rosbag_recorder]: Overriding subscription profile for /rosout [rosbag_run_recorder_node-9]
[INFO] [1730717199.258032766] [main_bag_rosbag_recorder]: Subscribed to topic '/rosout' [rosbag_run_recorder_node-9]
[INFO] [1730717199.268296194] [main_bag_rosbag_recorder]: Recording... [rosbag_run_recorder_node-9]
...
Error: [1730717288.011213136] [main_bag_rosbag_recorder]: discovery_future_.wait_for(100) return status: timeout [rosbag_run_recorder_node-9]
Error: [rosbag_run_recorder_node-9]: process has died [pid 111, exit code -11, cmd '/home/dev_ws/install/rosbag_run_recorder/lib/rosbag_run_recorder/rosbag_run_recorder_node --ros-args --log-level INFO --ros-args'].
Logging from other nodes and many "subscribed to topic" messages have been excluded.
To Reproduce
We have not been able to consistently reproduce, and this error occurs only once out of every 50 or so calls to stop(). Contributing factors seem to be heavy CPU load and larger bags, and it does seem to be somewhat device-dependent. Unfortunately I cannot share the full bag, but its mcap summary is
library: mcap go v1.4.1; libmcap 0.8.0
profile: ros2
messages: 743886
duration: 1m28.990235383s
start: 2024-11-04T11:46:38.837867611+01:00 (1730717198.837867611)
end: 2024-11-04T11:48:07.828102994+01:00 (1730717287.828102994)
compression:
zstd: [36/36 chunks] [143.28 MiB/26.77 MiB (81.31%)] [308.08 KiB/sec]
and it was recording on 59 topics.
We construct the node roughly as follows:
rosbag2_storage::StorageOptions storage_options = {
bagPath.string(), "mcap", 0, 0, 0,
"zstd_fast",
"", false};
rosbag2_transport::RecordOptions record_options;
record_options.topics = /* our own list of ~100 topics */;
record_options.topic_qos_profile_overrides = {{"/rosout", rclcpp::RosoutQoS()}};
record_options.rmw_serialization_format = std::string(rmw_get_serialization_format());
auto writer = rosbag2_transport::ReaderWriterFactory::make_writer(record_options);
return std::make_shared<rosbag2_transport::Recorder>(std::move(writer), std::make_shared<KeyboardHandler>(false), storage_options, record_options,"main_bag_rosbag_recorder");and execute it with a MultiThreadedExecutor constructed using the default arguments and 4 threads (the executor is also running 2 other Recorder nodes but they record far less data and we haven't had any issues with them).
System (please complete the following information)
- OS: Ubuntu 20.04, specifically from nvidia/cuda:11.4.3-base-ubuntu20.04
- ROS 2 Distro: Humble
- Install Method: source
- Version:
0.15.12(as part of Humble release 2024-07-29)
Additional context
This problem had also occurred when running via ros2 bag record, but there it manifested as the recorder apparently shutting down in response to SIGINT, but for some reason not closing the bag properly.