Skip to content

Remote rtc#3125

Open
grach0v wants to merge 7 commits intohuggingface:mainfrom
grach0v:remote_rtc
Open

Remote rtc#3125
grach0v wants to merge 7 commits intohuggingface:mainfrom
grach0v:remote_rtc

Conversation

@grach0v
Copy link

@grach0v grach0v commented Mar 10, 2026

feat(rtc): remote inference system + action queue delay fix

Type / Scope

  • Type: Feature + Bug
  • Scope: policies/rtc, policies/pi05, examples/remote_rtc

Summary / Motivation

Adds a client-server architecture for running RTC policy inference on a remote GPU server while the robot client runs on lightweight hardware (gRPC). Also fixes several bugs: pi05 RTC mode, torch.compile CUDA graph compatibility issues.

What changed

  • New: examples/remote_rtc/ — gRPC server (rtc_policy_server.py), robot client (eval_with_real_robot.py), dataset evaluator (eval_dataset.py)
  • New: src/lerobot/policies/rtc/remote.py — Shared data classes for client-server protocol (RTCObservationData, RTCActionData, RTCTimingData)
  • New: src/lerobot/policies/rtc/profiling.py — Per-request profiling with parquet export and matplotlib plots
  • Fix: src/lerobot/policies/rtc/action_queue.py — When no actions consumed during inference, return delay=0 instead of latency estimate (fixes first-movement jerk where 210/300 actions were skipped); merge() now returns int instead of None
  • Fix: src/lerobot/policies/pi05/modeling_pi05.py — Create inference_delay/execution_horizon tensors on model device (fixes CUDA graph warnings); normalize RTC inputs before torch.compile boundary to prevent recompilation
  • Fix: src/lerobot/policies/rtc/modeling_rtc.py — Rewrite guidance as pure tensor ops (no autograd, no .item()), compile-friendly get_prefix_weights, proper device/dtype propagation
  • Fix: pyproject.toml — RTX 5090 cu128 support, version caps for torch/torchcodec compatibility, uv resolver conflict reduction (19→4)

How was this tested

  • Real robot evaluation: Pi0.5 on Trossen Mobile (dual arm), 30fps, 40s
  • Smooth first movement confirmed
  • Profiling: 84 requests, server inference P50=184ms, client E2E P50=322ms, steady queue=35
  • torch.compile reduce-overhead mode: 309ms first inference → 184ms steady state
# Start server on GPU machine
python examples/remote_rtc/rtc_policy_server.py --host=0.0.0.0 --port=8080

# Run robot client
python examples/remote_rtc/eval_with_real_robot.py \
    --server_address=<GPU_IP>:8080 \
    --policy_type=pi05 \
    --pretrained_name_or_path=<checkpoint_path> \
    --robot.type=<robot_type> \
    --task="Your task" \
    --rtc.enabled=true \
    --rtc.execution_horizon=20 \
    --use_torch_compile=true \
    --enable_profiling=true

Copilot AI review requested due to automatic review settings March 10, 2026 17:46
@github-actions github-actions bot added policies Items related to robot policies examples Issues related to the examples labels Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a gRPC-based client/server path for running RTC-enabled policy inference remotely (GPU server) while keeping robot control on lightweight hardware, plus several RTC/PI0.5 compile & action-queue fixes and dependency updates.

Changes:

  • Introduces remote RTC protocol dataclasses + profiling utilities for per-request timing artifacts.
  • Adds remote RTC example server/client/dataset evaluator scripts using the new protocol.
  • Fixes RTC action queue delay handling and PI0.5 RTC/torch.compile compatibility; updates torch/torchvision/torchcodec + uv CUDA index config.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/lerobot/policies/rtc/remote.py Defines pickled dataclasses used as the remote RTC request/response protocol.
src/lerobot/policies/rtc/profiling.py Adds profiling record storage plus parquet export and matplotlib plotting.
src/lerobot/policies/rtc/modeling_rtc.py Makes RTC guidance more torch.compile-friendly and adjusts device/dtype handling.
src/lerobot/policies/rtc/action_queue.py Adds clear() and changes merge() to return the applied delay; fixes delay selection logic.
src/lerobot/policies/pi05/modeling_pi05.py Normalizes RTC inputs pre-compile boundary and makes action queue thread-local.
pyproject.toml Adjusts torch/torchvision/torchcodec/transformers bounds and adds uv cu128 index configuration.
examples/rtc/eval_with_real_robot.py Adds compile warmup + compile caching toggles; refactors preprocessing in the RTC demo.
examples/remote_rtc/rtc_policy_server.py New RTC inference server (gRPC) supporting RTC parameters + optional torch.compile.
examples/remote_rtc/eval_with_real_robot.py New robot-side remote client with action queue management and optional profiling.
examples/remote_rtc/eval_dataset.py New dataset-based remote RTC evaluator + plotting/profiling hooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +216 to +221
client_id = context.peer()
policy_specs = pickle.loads(request.data) # nosec

if not isinstance(policy_specs, RTCRemotePolicyConfig):
raise TypeError(f"Expected RTCRemotePolicyConfig, got {type(policy_specs)}")

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This server deserializes client-provided bytes with pickle.loads(). Pickle is unsafe against untrusted input (remote code execution), so running this on an open network is risky. If this is intended beyond trusted LAN/dev usage, consider switching the request/response payloads to protobuf (or another non-executable format) and/or adding explicit authentication + message validation before deserialization.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for trusted servers only

Comment on lines +235 to +240
self.device = policy_specs.device
self.policy_type = policy_specs.policy_type
self.lerobot_features = policy_specs.lerobot_features

# Load policy
self._unload_policy(reason="replacing_existing_policy")
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SendPolicyInstructions, self.device/self.policy_type/self.lerobot_features are set before calling _unload_policy(), but _unload_policy() clears these attributes when a policy is already loaded. This will make get_policy_class(self.policy_type) receive None (and self.policy.to(self.device) use None) on subsequent client connections / reconfiguration. Set new values after unloading, or keep them in locals and only assign to self.* once the previous policy has been unloaded.

Suggested change
self.device = policy_specs.device
self.policy_type = policy_specs.policy_type
self.lerobot_features = policy_specs.lerobot_features
# Load policy
self._unload_policy(reason="replacing_existing_policy")
# Load policy
self._unload_policy(reason="replacing_existing_policy")
# Set new configuration after unloading any existing policy to avoid it being cleared.
self.device = policy_specs.device
self.policy_type = policy_specs.policy_type
self.lerobot_features = policy_specs.lerobot_features

Copilot uses AI. Check for mistakes.
Comment on lines +535 to +538
except Exception as e:
logger.error(f"[GET_ACTIONS] Fatal error: {e}")
traceback.print_exc()
sys.exit(1)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sys.exit(1) inside a worker thread only terminates that thread (raises SystemExit there) and typically does not stop the whole process. If this is intended to be a fatal error, signal the main loop via self.shutdown_event/ProcessSignalHandler, and let the main thread exit (or use os._exit(1) as a last resort).

Copilot uses AI. Check for mistakes.
Comment on lines +567 to +570
except Exception as e:
logger.error(f"[ACTOR] Fatal error: {e}")
traceback.print_exc()
sys.exit(1)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: sys.exit(1) in a non-main thread won't necessarily stop the program, so fatal errors in the actor thread may leave the process running in a bad state. Prefer setting a shutdown flag / raising to the main thread so cleanup and termination happen deterministically.

Copilot uses AI. Check for mistakes.
Comment on lines +789 to +791
inference_delay: int | None = None,
prev_chunk_left_over: Tensor | None = None,
execution_horizon: int | None = None,
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PI05Policy.predict_action_chunk converts inference_delay / execution_horizon to torch.Tensor for torch.compile stability, but PI05Pytorch.sample_actions still types these parameters as int | None. Update the annotations (e.g., int | Tensor | None) to reflect actual supported inputs and avoid type-checking / reader confusion.

Suggested change
inference_delay: int | None = None,
prev_chunk_left_over: Tensor | None = None,
execution_horizon: int | None = None,
inference_delay: int | Tensor | None = None,
prev_chunk_left_over: Tensor | None = None,
execution_horizon: int | Tensor | None = None,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples Issues related to the examples policies Items related to robot policies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants