feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality) by maeste · Pull Request #3762 · huggingface/lerobot

maeste · 2026-06-10T13:16:20Z

Summary / Motivation

This PR adds two complementary, read-only dataset diagnostic CLIs that audit recorded teleoperation datasets before GPU hours are spent training on them:

lerobot-check-calibration — detects leader/follower calibration drift. In teleoperation, action is the leader position and observation.state is the follower position; a calibration mismatch between the arms bakes a systematic offset into every frame. This is invisible during training (the policy learns the biased mapping, loss looks healthy) but breaks deployment, since inference feeds follower observations to a policy trained on leader-shifted targets. The tool compares the two signals on "stable" frames (robot not moving, so the follower controller has converged) and reports per-motor mean/std with a verdict. Real-world motivation: on an SO-101 setup, a ~17° offset on one joint (≈6 cm Cartesian error at the gripper) went completely unnoticed through recording and training — this finds it in seconds.
lerobot-dataset-quality — flags outlier episodes. Once a dataset grows past a few dozen episodes, reviewing each one in lerobot-dataset-viz is impractical. The tool computes deterministic per-episode metrics from the actions (duration, median/p95 jerk, peak velocity, static fraction, end-pose consistency), flags statistical outliers via the IQR rule, and prints a ranked worst-episodes list with a ready-to-edit lerobot-edit-dataset --operation.type delete_episodes command — turning "watch 200 episodes" into "watch these 8".

Both tools load any dataset via the standard LeRobotDataset API, read only the action/state columns (no image/video decoding, runs in seconds), require no new dependencies, and never modify the dataset.

Submitted as one PR since they share the same shape (read-only diagnostics following the lerobot-dataset-viz CLI conventions) and documentation page; supersedes #3759 and #3761.

Related issues

What changed

src/lerobot/scripts/lerobot_check_calibration.py: calibration drift analysis. Core logic in pure, importable functions (compute_episode_deltas, summarize_calibration, check_calibration) with a thin argparse main(). Flags: --vel-threshold (stability cutoff), --ok-threshold/--warn-threshold (verdict boundaries — configurable since action units are dataset-dependent), --arm-length-cm (optional Cartesian impact estimate), --output-format table|json.
src/lerobot/scripts/lerobot_dataset_quality.py: per-episode quality metrics + IQR outlier flagging (compute_episode_metrics, detect_outliers, evaluate_dataset_quality). Flags: --k-iqr (outlier strictness), --top-bad, --output-format table|json.
pyproject.toml: two new entry points, lerobot-check-calibration and lerobot-dataset-quality.
tests/scripts/test_check_calibration.py + tests/scripts/test_dataset_quality.py: 22 unit tests on synthetic trajectories with known properties (known offsets, spikes, holds, divergent end poses), covering verdicts, outlier detection, episode grouping/sorting, and end-to-end reports including error paths.
docs/source/using_dataset_tools.mdx: user-facing documentation sections for both tools.

No breaking changes; purely additive and read-only.

How was this tested (or how to run locally)

Tests added: uv run pytest tests/scripts/test_check_calibration.py tests/scripts/test_dataset_quality.py -v (22 passed)
pre-commit run passes on all touched files (ruff, mypy, bandit, typos, prettier).
Manually verified on real SO-101 teleop datasets:
- calibration check correctly flags a known miscalibrated dataset (17° offset on shoulder_lift) and reports ok on a freshly recalibrated one;
- quality flags matched episodes already identified as bad by manual review (over-long struggled attempts, wrong end poses).
Quick reviewer repro: lerobot-dataset-quality --repo-id lerobot/pusht (any dataset with an action feature), lerobot-check-calibration --repo-id <any teleop dataset with action + observation.state>.

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green
Community Review: I have reviewed another contributor's open PR and linked it here: # (pending)

Reviewer notes

The stability filter is the heart of the calibration method: on moving frames action - state is dominated by controller lag, so only converged frames are informative (compute_episode_deltas).
The IQR rule was chosen over z-scores because episode metrics are typically skewed (a few very long episodes); it also flags nothing on a perfectly uniform dataset. median_jerk vs p95_jerk is deliberate: the median is robust to isolated corrective spikes, the p95 catches exactly those spikes.
The two commits are kept separate (one per tool) for easier review; happy to squash, rename the commands, or split back into two PRs if maintainers prefer.

🤖 Generated with Claude Code

…er drift Adds a post-hoc analysis tool that compares action (leader position) with observation.state (follower position) on stable frames of a recorded teleoperation dataset. A systematic non-zero mean delta on a joint reveals a calibration offset between the two arms, which is invisible during training (the loss stays low) but breaks deployment. - src/lerobot/scripts/lerobot_check_calibration.py: core analysis + CLI - pyproject.toml: lerobot-check-calibration entry point - tests/scripts/test_check_calibration.py: unit tests for delta extraction, verdict classification, episode grouping and the end-to-end report - docs/source/using_dataset_tools.mdx: user-facing documentation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Adds a read-only analysis tool that computes deterministic per-episode quality metrics from recorded actions (duration, median/p95 jerk, peak velocity, static fraction, end-pose consistency) and flags statistical outliers via the IQR rule. Complements lerobot-dataset-viz for datasets too large to review episode by episode, and feeds candidate episodes to lerobot-edit-dataset --operation.type=delete_episodes. - src/lerobot/scripts/lerobot_dataset_quality.py: core metrics + CLI - pyproject.toml: lerobot-dataset-quality entry point - tests/scripts/test_dataset_quality.py: unit tests for metrics, outlier detection, episode grouping and the end-to-end report - docs/source/using_dataset_tools.mdx: user-facing documentation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maeste and others added 2 commits June 10, 2026 15:13

This was referenced Jun 10, 2026

feat(datasets): add lerobot-check-calibration to detect leader/follower calibration drift #3759

Closed

feat(datasets): add lerobot-dataset-quality to flag outlier episodes #3761

Closed

github-actions Bot added documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing labels Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality)#3762

feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality)#3762
maeste wants to merge 2 commits into
huggingface:mainfrom
maeste:feature/dataset-diagnostics

maeste commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maeste commented Jun 10, 2026

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant