feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality)#3762
Open
maeste wants to merge 2 commits into
Open
feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality)#3762maeste wants to merge 2 commits into
maeste wants to merge 2 commits into
Conversation
…er drift Adds a post-hoc analysis tool that compares action (leader position) with observation.state (follower position) on stable frames of a recorded teleoperation dataset. A systematic non-zero mean delta on a joint reveals a calibration offset between the two arms, which is invisible during training (the loss stays low) but breaks deployment. - src/lerobot/scripts/lerobot_check_calibration.py: core analysis + CLI - pyproject.toml: lerobot-check-calibration entry point - tests/scripts/test_check_calibration.py: unit tests for delta extraction, verdict classification, episode grouping and the end-to-end report - docs/source/using_dataset_tools.mdx: user-facing documentation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds a read-only analysis tool that computes deterministic per-episode quality metrics from recorded actions (duration, median/p95 jerk, peak velocity, static fraction, end-pose consistency) and flags statistical outliers via the IQR rule. Complements lerobot-dataset-viz for datasets too large to review episode by episode, and feeds candidate episodes to lerobot-edit-dataset --operation.type=delete_episodes. - src/lerobot/scripts/lerobot_dataset_quality.py: core metrics + CLI - pyproject.toml: lerobot-dataset-quality entry point - tests/scripts/test_dataset_quality.py: unit tests for metrics, outlier detection, episode grouping and the end-to-end report - docs/source/using_dataset_tools.mdx: user-facing documentation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 10, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary / Motivation
This PR adds two complementary, read-only dataset diagnostic CLIs that audit recorded teleoperation datasets before GPU hours are spent training on them:
lerobot-check-calibration— detects leader/follower calibration drift. In teleoperation,actionis the leader position andobservation.stateis the follower position; a calibration mismatch between the arms bakes a systematic offset into every frame. This is invisible during training (the policy learns the biased mapping, loss looks healthy) but breaks deployment, since inference feeds follower observations to a policy trained on leader-shifted targets. The tool compares the two signals on "stable" frames (robot not moving, so the follower controller has converged) and reports per-motor mean/std with a verdict. Real-world motivation: on an SO-101 setup, a ~17° offset on one joint (≈6 cm Cartesian error at the gripper) went completely unnoticed through recording and training — this finds it in seconds.lerobot-dataset-quality— flags outlier episodes. Once a dataset grows past a few dozen episodes, reviewing each one inlerobot-dataset-vizis impractical. The tool computes deterministic per-episode metrics from the actions (duration, median/p95 jerk, peak velocity, static fraction, end-pose consistency), flags statistical outliers via the IQR rule, and prints a ranked worst-episodes list with a ready-to-editlerobot-edit-dataset --operation.type delete_episodescommand — turning "watch 200 episodes" into "watch these 8".Both tools load any dataset via the standard
LeRobotDatasetAPI, read only the action/state columns (no image/video decoding, runs in seconds), require no new dependencies, and never modify the dataset.Submitted as one PR since they share the same shape (read-only diagnostics following the
lerobot-dataset-vizCLI conventions) and documentation page; supersedes #3759 and #3761.Related issues
What changed
src/lerobot/scripts/lerobot_check_calibration.py: calibration drift analysis. Core logic in pure, importable functions (compute_episode_deltas,summarize_calibration,check_calibration) with a thin argparsemain(). Flags:--vel-threshold(stability cutoff),--ok-threshold/--warn-threshold(verdict boundaries — configurable since action units are dataset-dependent),--arm-length-cm(optional Cartesian impact estimate),--output-format table|json.src/lerobot/scripts/lerobot_dataset_quality.py: per-episode quality metrics + IQR outlier flagging (compute_episode_metrics,detect_outliers,evaluate_dataset_quality). Flags:--k-iqr(outlier strictness),--top-bad,--output-format table|json.pyproject.toml: two new entry points,lerobot-check-calibrationandlerobot-dataset-quality.tests/scripts/test_check_calibration.py+tests/scripts/test_dataset_quality.py: 22 unit tests on synthetic trajectories with known properties (known offsets, spikes, holds, divergent end poses), covering verdicts, outlier detection, episode grouping/sorting, and end-to-end reports including error paths.docs/source/using_dataset_tools.mdx: user-facing documentation sections for both tools.No breaking changes; purely additive and read-only.
How was this tested (or how to run locally)
uv run pytest tests/scripts/test_check_calibration.py tests/scripts/test_dataset_quality.py -v(22 passed)pre-commit runpasses on all touched files (ruff, mypy, bandit, typos, prettier).shoulder_lift) and reportsokon a freshly recalibrated one;lerobot-dataset-quality --repo-id lerobot/pusht(any dataset with anactionfeature),lerobot-check-calibration --repo-id <any teleop dataset with action + observation.state>.Checklist (required before merge)
pre-commit run -a)pytest)Reviewer notes
action - stateis dominated by controller lag, so only converged frames are informative (compute_episode_deltas).median_jerkvsp95_jerkis deliberate: the median is robust to isolated corrective spikes, the p95 catches exactly those spikes.🤖 Generated with Claude Code