feat(mcap): add robotics telemetry read planning#7074
Conversation
Greptile SummaryThis PR adds
Confidence Score: 3/5Two real defects in the core implementation need attention before merging: the window-degeneration bug and the unannounced behavioral change to topic_start_time_resolver. The window-planning logic produces incorrect output when all messages in a topic share the same log_time — estimated counts are non-zero for windows that return nothing from read_mcap. Separately, the removal of the try/except around topic_start_time_resolver silently changes a public contract, turning graceful degradation into hard failures for existing callers. Both defects are in the changed code path. daft/io/mcap/_mcap.py — specifically the window-count clamping in plan_mcap_reads and the get_tasks error handling change. Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant inspect_mcap
participant plan_mcap_reads
participant _inspect_mcap_rows
participant read_mcap
participant MCAPSource
participant MCAPSourceTask
User->>inspect_mcap: inspect_mcap(path, topics, start_time, end_time)
inspect_mcap->>_inspect_mcap_rows: _inspect_mcap_rows(path, ...)
_inspect_mcap_rows->>_inspect_mcap_rows: list_files then open each MCAP
_inspect_mcap_rows->>_inspect_mcap_rows: iter_messages (no decoders) then accumulate _MCAPTopicStats
_inspect_mcap_rows-->>inspect_mcap: list of dict rows
inspect_mcap-->>User: DataFrame (topic manifest)
User->>plan_mcap_reads: plan_mcap_reads(path, max_messages_per_task)
plan_mcap_reads->>_inspect_mcap_rows: _inspect_mcap_rows(path, ...)
_inspect_mcap_rows-->>plan_mcap_reads: manifest rows
plan_mcap_reads->>plan_mcap_reads: split each topic span into time windows
plan_mcap_reads-->>User: DataFrame (read plan rows)
User->>read_mcap: read_mcap(file_path, topics, start_time, end_time)
read_mcap->>MCAPSource: MCAPSource(include_record_metadata)
MCAPSource->>MCAPSource: topic_start_time_resolver(file_path) if set
MCAPSource->>MCAPSourceTask: yield MCAPSourceTask per file/topic
MCAPSourceTask->>MCAPSourceTask: iter_messages then buffer rows then yield RecordBatch
MCAPSourceTask-->>User: DataFrame (messages)
Reviews (1): Last reviewed commit: "feat(mcap): add robotics telemetry read ..." | Re-trigger Greptile |
| if self._topic_start_time_resolver is not None: | ||
| try: | ||
| keyframes = self._topic_start_time_resolver(file_path) | ||
| except Exception: | ||
| keyframes = None | ||
| keyframes = self._topic_start_time_resolver(file_path) | ||
|
|
||
| if not keyframes: |
There was a problem hiding this comment.
Silent error swallowing removed — behavior-breaking for existing callers
The previous code wrapped self._topic_start_time_resolver(file_path) in a try/except Exception that fell back to keyframes = None on any failure, so a resolver that threw would degrade gracefully to a full-file read. Removing that guard means any exception from the resolver now propagates and aborts the entire read, which is a change in public contract for users who had resolvers that occasionally fail. If this is intentional, the PR title should carry ! per the project's conventional commit convention (e.g., feat!(mcap): ...).
| Args: | ||
| path: MCAP file or directory path. | ||
| io_config: IO configuration to use for remote storage. | ||
| start_time: Optional lower log-time bound. | ||
| end_time: Optional upper log-time bound. | ||
| topics: Optional topics to plan. | ||
| max_messages_per_task: Target maximum messages per planned read window. | ||
|
|
||
| Returns: | ||
| DataFrame: One row per planned file/topic/time window. | ||
| """ | ||
| import math | ||
|
|
There was a problem hiding this comment.
Degenerate windows when all messages share the same log timestamp
When first_log_time == last_log_time (all messages on a single timestamp), end_bound - start_bound == 1 so window_width is always 1 regardless of how many windows window_count requests. Window 0 covers [T, T+1) and contains every message; all subsequent windows cover [T+1, …) and are empty. Meanwhile, estimated_message_count distributes the total evenly across all windows using divmod, so callers get non-zero estimates for windows that actually return nothing from read_mcap. A guard like if window_count > end_bound - start_bound: window_count = end_bound - start_bound (minimum 1) would avoid emitting unreachable windows.
| def _empty_mcap_inspection() -> DataFrame: | ||
| import pyarrow as pa | ||
|
|
||
| from daft.convert import from_arrow | ||
|
|
||
| schema = pa.schema( | ||
| [ | ||
| pa.field("file_path", pa.string()), | ||
| pa.field("topic", pa.string()), | ||
| pa.field("schema_name", pa.string()), | ||
| pa.field("schema_encoding", pa.string()), | ||
| pa.field("message_encoding", pa.string()), | ||
| pa.field("message_count", pa.int64()), | ||
| pa.field("first_log_time", pa.int64()), | ||
| pa.field("last_log_time", pa.int64()), | ||
| pa.field("first_publish_time", pa.int64()), | ||
| pa.field("last_publish_time", pa.int64()), | ||
| pa.field("min_message_size", pa.int64()), | ||
| pa.field("max_message_size", pa.int64()), | ||
| pa.field("total_message_size", pa.int64()), | ||
| pa.field("avg_message_size", pa.float64()), | ||
| ] | ||
| ) | ||
| return from_arrow(pa.Table.from_pylist([], schema=schema)) | ||
|
|
||
|
|
||
| def _inspect_mcap_rows( | ||
| path: str, | ||
| io_config: IOConfig | None = None, | ||
| start_time: int | None = None, | ||
| end_time: int | None = None, | ||
| topics: list[str] | None = None, | ||
| ) -> list[dict[str, object]]: | ||
| import importlib | ||
|
|
||
| make_reader = importlib.import_module("mcap.reader").make_reader |
There was a problem hiding this comment.
Inline imports violate project import convention
import pyarrow as pa, from daft.convert import from_arrow, import importlib, from daft.convert import from_pylist, and import math are all scattered inside the new functions. The project rule requires import statements at the top of the file. math and importlib are standard-library modules with no optional-dependency concern; the stdlib and internal daft imports should move to the module header.
Rule Used: Import statements should be placed at the top of t... (source)
Learned From
Eventual-Inc/Daft#5078
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| | Column | Type | Description | | ||
| |--------|------|-------------| | ||
| | `file_path` | `string` | MCAP file path to read | | ||
| | `topic` | `string` | Topic to read | | ||
| | `window_index` | `int64` | Zero-based window index for this file/topic | | ||
| | `window_count` | `int64` | Number of windows planned for this file/topic | | ||
| | `start_time` | `int64` | Inclusive log-time lower bound | | ||
| | `end_time` | `int64` | Exclusive log-time upper bound | | ||
| | `estimated_message_count` | `int64` | Approximate number of messages in this window | | ||
| | `estimated_payload_bytes` | `int64` | Approximate raw payload bytes in this window | |
There was a problem hiding this comment.
The
plan_mcap_reads output table is missing three columns that are actually present in every output row: schema_name, schema_encoding, and message_encoding. These come directly from the _empty_mcap_read_plan schema and the plan_rows dict.
| | Column | Type | Description | | |
| |--------|------|-------------| | |
| | `file_path` | `string` | MCAP file path to read | | |
| | `topic` | `string` | Topic to read | | |
| | `window_index` | `int64` | Zero-based window index for this file/topic | | |
| | `window_count` | `int64` | Number of windows planned for this file/topic | | |
| | `start_time` | `int64` | Inclusive log-time lower bound | | |
| | `end_time` | `int64` | Exclusive log-time upper bound | | |
| | `estimated_message_count` | `int64` | Approximate number of messages in this window | | |
| | `estimated_payload_bytes` | `int64` | Approximate raw payload bytes in this window | | |
| | Column | Type | Description | | |
| |--------|------|-------------| | |
| | `file_path` | `string` | MCAP file path to read | | |
| | `topic` | `string` | Topic to read | | |
| | `schema_name` | `string` | MCAP schema name, such as a ROS2 message type | | |
| | `schema_encoding` | `string` | Schema encoding stored in the MCAP file | | |
| | `message_encoding` | `string` | Channel message encoding | | |
| | `window_index` | `int64` | Zero-based window index for this file/topic | | |
| | `window_count` | `int64` | Number of windows planned for this file/topic | | |
| | `start_time` | `int64` | Inclusive log-time lower bound | | |
| | `end_time` | `int64` | Exclusive log-time upper bound | | |
| | `estimated_message_count` | `int64` | Approximate number of messages in this window | | |
| | `estimated_payload_bytes` | `int64` | Approximate raw payload bytes in this window | |
|
Looks like the remaining red checks are from external Hugging Face rate limits rather than this MCAP change. Both native/ray IO jobs are failing on HTTP 429 for Eventual-Inc/sample-parquet / stanfordnlp/imdb, while style and the focused MCAP tests are green locally. I don’t have permission to rerun the failed jobs from the fork. Could someone rerun those IO jobs when convenient? |
|
Quick follow-up on the Greptile review: the two core issues it flagged are covered in the latest PR head. What changed after the reviewed commit c48d270:
The resolver behavior is also back to graceful fallback: topic_start_time_resolver exceptions are caught and the file falls back to the unscoped read path. There is regression coverage in test_mcap_topic_start_time_resolver_errors_fall_back_to_unscoped_reads. Relevant tests now in the PR:
Docs also include the plan output metadata columns schema_name, schema_encoding, and message_encoding in the planning table. I can’t run the focused Daft pytest in this local checkout because the compiled extension cannot be imported here, but git diff --check is clean and the PR branch includes the regression coverage above. A Greptile/CI re-review should be looking at cd4abd7 rather than the earlier c48d270 commit. |
cd4abd7 to
23f5fb6
Compare
Summary
Adds a lightweight planning layer for large MCAP robotics logs, so users can inspect topic/time coverage and plan sharded or resumable reads before decoding payloads.
This adds:
daft.inspect_mcap(...)for topic manifests with schema/message metadata, time bounds, message counts, and payload-size estimatesdaft.plan_mcap_reads(...)for per-topic time windows sized bymax_messages_per_taskread_mcap(..., include_record_metadata=True)columns for schema/message encoding and payload sizeread_mcapWhy
Large robotics logs often mix high-rate IMU data, bursty camera frames, and LiDAR topics in one MCAP file. For distributed processing, replay, or debugging, it helps to cheaply answer: what topics exist, what time ranges do they cover, and how should the work be split by topic/time before reading payloads?
inspect_mcapgives that manifest without decoding messages.plan_mcap_readsturns it into concrete read windows that can be resumed or scheduled independently.Testing
python3 -m compileall daft/io/mcap/_mcap.py daft/io/__init__.py daft/__init__.py tests/io/mcap/test_mcap.pypython3 -m ruff check daft/io/mcap/_mcap.py daft/io/__init__.py tests/io/mcap/test_mcap.pypython3 -m ruff format daft/io/mcap/_mcap.py daft/io/__init__.py daft/__init__.py tests/io/mcap/test_mcap.py --checkgit diff --checkDAFT_RUNNER=native .venv/bin/python -m pytest tests/io/mcap/test_mcap.py -qFocused pytest result:
13 passed, 1 skipped.