Skip to content

feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality)#3762

Open
maeste wants to merge 2 commits into
huggingface:mainfrom
maeste:feature/dataset-diagnostics
Open

feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality)#3762
maeste wants to merge 2 commits into
huggingface:mainfrom
maeste:feature/dataset-diagnostics

Conversation

@maeste

@maeste maeste commented Jun 10, 2026

Copy link
Copy Markdown

Summary / Motivation

This PR adds two complementary, read-only dataset diagnostic CLIs that audit recorded teleoperation datasets before GPU hours are spent training on them:

  1. lerobot-check-calibration — detects leader/follower calibration drift. In teleoperation, action is the leader position and observation.state is the follower position; a calibration mismatch between the arms bakes a systematic offset into every frame. This is invisible during training (the policy learns the biased mapping, loss looks healthy) but breaks deployment, since inference feeds follower observations to a policy trained on leader-shifted targets. The tool compares the two signals on "stable" frames (robot not moving, so the follower controller has converged) and reports per-motor mean/std with a verdict. Real-world motivation: on an SO-101 setup, a ~17° offset on one joint (≈6 cm Cartesian error at the gripper) went completely unnoticed through recording and training — this finds it in seconds.

  2. lerobot-dataset-quality — flags outlier episodes. Once a dataset grows past a few dozen episodes, reviewing each one in lerobot-dataset-viz is impractical. The tool computes deterministic per-episode metrics from the actions (duration, median/p95 jerk, peak velocity, static fraction, end-pose consistency), flags statistical outliers via the IQR rule, and prints a ranked worst-episodes list with a ready-to-edit lerobot-edit-dataset --operation.type delete_episodes command — turning "watch 200 episodes" into "watch these 8".

Both tools load any dataset via the standard LeRobotDataset API, read only the action/state columns (no image/video decoding, runs in seconds), require no new dependencies, and never modify the dataset.

Submitted as one PR since they share the same shape (read-only diagnostics following the lerobot-dataset-viz CLI conventions) and documentation page; supersedes #3759 and #3761.

Related issues

What changed

  • src/lerobot/scripts/lerobot_check_calibration.py: calibration drift analysis. Core logic in pure, importable functions (compute_episode_deltas, summarize_calibration, check_calibration) with a thin argparse main(). Flags: --vel-threshold (stability cutoff), --ok-threshold/--warn-threshold (verdict boundaries — configurable since action units are dataset-dependent), --arm-length-cm (optional Cartesian impact estimate), --output-format table|json.
  • src/lerobot/scripts/lerobot_dataset_quality.py: per-episode quality metrics + IQR outlier flagging (compute_episode_metrics, detect_outliers, evaluate_dataset_quality). Flags: --k-iqr (outlier strictness), --top-bad, --output-format table|json.
  • pyproject.toml: two new entry points, lerobot-check-calibration and lerobot-dataset-quality.
  • tests/scripts/test_check_calibration.py + tests/scripts/test_dataset_quality.py: 22 unit tests on synthetic trajectories with known properties (known offsets, spikes, holds, divergent end poses), covering verdicts, outlier detection, episode grouping/sorting, and end-to-end reports including error paths.
  • docs/source/using_dataset_tools.mdx: user-facing documentation sections for both tools.

No breaking changes; purely additive and read-only.

How was this tested (or how to run locally)

  • Tests added: uv run pytest tests/scripts/test_check_calibration.py tests/scripts/test_dataset_quality.py -v (22 passed)
  • pre-commit run passes on all touched files (ruff, mypy, bandit, typos, prettier).
  • Manually verified on real SO-101 teleop datasets:
    • calibration check correctly flags a known miscalibrated dataset (17° offset on shoulder_lift) and reports ok on a freshly recalibrated one;
    • quality flags matched episodes already identified as bad by manual review (over-long struggled attempts, wrong end poses).
  • Quick reviewer repro: lerobot-dataset-quality --repo-id lerobot/pusht (any dataset with an action feature), lerobot-check-calibration --repo-id <any teleop dataset with action + observation.state>.

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green
  • Community Review: I have reviewed another contributor's open PR and linked it here: # (pending)

Reviewer notes

  • The stability filter is the heart of the calibration method: on moving frames action - state is dominated by controller lag, so only converged frames are informative (compute_episode_deltas).
  • The IQR rule was chosen over z-scores because episode metrics are typically skewed (a few very long episodes); it also flags nothing on a perfectly uniform dataset. median_jerk vs p95_jerk is deliberate: the median is robust to isolated corrective spikes, the p95 catches exactly those spikes.
  • The two commits are kept separate (one per tool) for easier review; happy to squash, rename the commands, or split back into two PRs if maintainers prefer.

🤖 Generated with Claude Code

maeste and others added 2 commits June 10, 2026 15:13
…er drift

Adds a post-hoc analysis tool that compares action (leader position) with
observation.state (follower position) on stable frames of a recorded
teleoperation dataset. A systematic non-zero mean delta on a joint reveals
a calibration offset between the two arms, which is invisible during
training (the loss stays low) but breaks deployment.

- src/lerobot/scripts/lerobot_check_calibration.py: core analysis + CLI
- pyproject.toml: lerobot-check-calibration entry point
- tests/scripts/test_check_calibration.py: unit tests for delta extraction,
  verdict classification, episode grouping and the end-to-end report
- docs/source/using_dataset_tools.mdx: user-facing documentation

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adds a read-only analysis tool that computes deterministic per-episode
quality metrics from recorded actions (duration, median/p95 jerk, peak
velocity, static fraction, end-pose consistency) and flags statistical
outliers via the IQR rule. Complements lerobot-dataset-viz for datasets
too large to review episode by episode, and feeds candidate episodes to
lerobot-edit-dataset --operation.type=delete_episodes.

- src/lerobot/scripts/lerobot_dataset_quality.py: core metrics + CLI
- pyproject.toml: lerobot-dataset-quality entry point
- tests/scripts/test_dataset_quality.py: unit tests for metrics, outlier
  detection, episode grouping and the end-to-end report
- docs/source/using_dataset_tools.mdx: user-facing documentation

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing labels Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

1 participant