Conversation
There was a problem hiding this comment.
Pull request overview
Adds a gRPC-based client/server path for running RTC-enabled policy inference remotely (GPU server) while keeping robot control on lightweight hardware, plus several RTC/PI0.5 compile & action-queue fixes and dependency updates.
Changes:
- Introduces remote RTC protocol dataclasses + profiling utilities for per-request timing artifacts.
- Adds remote RTC example server/client/dataset evaluator scripts using the new protocol.
- Fixes RTC action queue delay handling and PI0.5 RTC/torch.compile compatibility; updates torch/torchvision/torchcodec + uv CUDA index config.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lerobot/policies/rtc/remote.py | Defines pickled dataclasses used as the remote RTC request/response protocol. |
| src/lerobot/policies/rtc/profiling.py | Adds profiling record storage plus parquet export and matplotlib plotting. |
| src/lerobot/policies/rtc/modeling_rtc.py | Makes RTC guidance more torch.compile-friendly and adjusts device/dtype handling. |
| src/lerobot/policies/rtc/action_queue.py | Adds clear() and changes merge() to return the applied delay; fixes delay selection logic. |
| src/lerobot/policies/pi05/modeling_pi05.py | Normalizes RTC inputs pre-compile boundary and makes action queue thread-local. |
| pyproject.toml | Adjusts torch/torchvision/torchcodec/transformers bounds and adds uv cu128 index configuration. |
| examples/rtc/eval_with_real_robot.py | Adds compile warmup + compile caching toggles; refactors preprocessing in the RTC demo. |
| examples/remote_rtc/rtc_policy_server.py | New RTC inference server (gRPC) supporting RTC parameters + optional torch.compile. |
| examples/remote_rtc/eval_with_real_robot.py | New robot-side remote client with action queue management and optional profiling. |
| examples/remote_rtc/eval_dataset.py | New dataset-based remote RTC evaluator + plotting/profiling hooks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| client_id = context.peer() | ||
| policy_specs = pickle.loads(request.data) # nosec | ||
|
|
||
| if not isinstance(policy_specs, RTCRemotePolicyConfig): | ||
| raise TypeError(f"Expected RTCRemotePolicyConfig, got {type(policy_specs)}") | ||
|
|
There was a problem hiding this comment.
This server deserializes client-provided bytes with pickle.loads(). Pickle is unsafe against untrusted input (remote code execution), so running this on an open network is risky. If this is intended beyond trusted LAN/dev usage, consider switching the request/response payloads to protobuf (or another non-executable format) and/or adding explicit authentication + message validation before deserialization.
There was a problem hiding this comment.
This is for trusted servers only
| self.device = policy_specs.device | ||
| self.policy_type = policy_specs.policy_type | ||
| self.lerobot_features = policy_specs.lerobot_features | ||
|
|
||
| # Load policy | ||
| self._unload_policy(reason="replacing_existing_policy") |
There was a problem hiding this comment.
In SendPolicyInstructions, self.device/self.policy_type/self.lerobot_features are set before calling _unload_policy(), but _unload_policy() clears these attributes when a policy is already loaded. This will make get_policy_class(self.policy_type) receive None (and self.policy.to(self.device) use None) on subsequent client connections / reconfiguration. Set new values after unloading, or keep them in locals and only assign to self.* once the previous policy has been unloaded.
| self.device = policy_specs.device | |
| self.policy_type = policy_specs.policy_type | |
| self.lerobot_features = policy_specs.lerobot_features | |
| # Load policy | |
| self._unload_policy(reason="replacing_existing_policy") | |
| # Load policy | |
| self._unload_policy(reason="replacing_existing_policy") | |
| # Set new configuration after unloading any existing policy to avoid it being cleared. | |
| self.device = policy_specs.device | |
| self.policy_type = policy_specs.policy_type | |
| self.lerobot_features = policy_specs.lerobot_features |
| except Exception as e: | ||
| logger.error(f"[GET_ACTIONS] Fatal error: {e}") | ||
| traceback.print_exc() | ||
| sys.exit(1) |
There was a problem hiding this comment.
sys.exit(1) inside a worker thread only terminates that thread (raises SystemExit there) and typically does not stop the whole process. If this is intended to be a fatal error, signal the main loop via self.shutdown_event/ProcessSignalHandler, and let the main thread exit (or use os._exit(1) as a last resort).
| except Exception as e: | ||
| logger.error(f"[ACTOR] Fatal error: {e}") | ||
| traceback.print_exc() | ||
| sys.exit(1) |
There was a problem hiding this comment.
Same issue here: sys.exit(1) in a non-main thread won't necessarily stop the program, so fatal errors in the actor thread may leave the process running in a bad state. Prefer setting a shutdown flag / raising to the main thread so cleanup and termination happen deterministically.
| inference_delay: int | None = None, | ||
| prev_chunk_left_over: Tensor | None = None, | ||
| execution_horizon: int | None = None, |
There was a problem hiding this comment.
PI05Policy.predict_action_chunk converts inference_delay / execution_horizon to torch.Tensor for torch.compile stability, but PI05Pytorch.sample_actions still types these parameters as int | None. Update the annotations (e.g., int | Tensor | None) to reflect actual supported inputs and avoid type-checking / reader confusion.
| inference_delay: int | None = None, | |
| prev_chunk_left_over: Tensor | None = None, | |
| execution_horizon: int | None = None, | |
| inference_delay: int | Tensor | None = None, | |
| prev_chunk_left_over: Tensor | None = None, | |
| execution_horizon: int | Tensor | None = None, |
feat(rtc): remote inference system + action queue delay fix
Type / Scope
Summary / Motivation
Adds a client-server architecture for running RTC policy inference on a remote GPU server while the robot client runs on lightweight hardware (gRPC). Also fixes several bugs: pi05 RTC mode, torch.compile CUDA graph compatibility issues.
What changed
examples/remote_rtc/— gRPC server (rtc_policy_server.py), robot client (eval_with_real_robot.py), dataset evaluator (eval_dataset.py)src/lerobot/policies/rtc/remote.py— Shared data classes for client-server protocol (RTCObservationData, RTCActionData, RTCTimingData)src/lerobot/policies/rtc/profiling.py— Per-request profiling with parquet export and matplotlib plotssrc/lerobot/policies/rtc/action_queue.py— When no actions consumed during inference, return delay=0 instead of latency estimate (fixes first-movement jerk where 210/300 actions were skipped); merge() now returns int instead of Nonesrc/lerobot/policies/pi05/modeling_pi05.py— Create inference_delay/execution_horizon tensors on model device (fixes CUDA graph warnings); normalize RTC inputs before torch.compile boundary to prevent recompilationsrc/lerobot/policies/rtc/modeling_rtc.py— Rewrite guidance as pure tensor ops (no autograd, no .item()), compile-friendly get_prefix_weights, proper device/dtype propagationpyproject.toml— RTX 5090 cu128 support, version caps for torch/torchcodec compatibility, uv resolver conflict reduction (19→4)How was this tested