feat(datasets): add lerobot-dataset-quality to flag outlier episodes by maeste · Pull Request #3761 · huggingface/lerobot

maeste · 2026-06-10T13:01:30Z

Summary / Motivation

Once a teleoperation dataset grows past a few dozen episodes, reviewing each one in lerobot-dataset-viz is impractical, so bad demonstrations (struggled long attempts, sharp corrections, hesitations, wrong end poses) silently stay in the training set. This PR adds lerobot-dataset-quality, a read-only CLI that computes deterministic per-episode metrics from the recorded actions and flags statistical outliers, turning "watch 200 episodes" into "watch these 8". The flagged list feeds directly into the existing lerobot-dataset-viz → lerobot-edit-dataset --operation.type delete_episodes workflow.

Related issues

Closes: [feat] lerobot-dataset-quality: per-episode quality metrics and outlier flagging for recorded datasets #3760
Related: [feat] lerobot-check-calibration: detect leader/follower calibration drift in recorded datasets #3758 (companion diagnostic for leader/follower calibration drift)

What changed

src/lerobot/scripts/lerobot_dataset_quality.py: new analysis tool. Core logic is in pure, importable functions (compute_episode_metrics, detect_outliers, evaluate_dataset_quality) with a thin argparse main() following the lerobot-dataset-viz CLI conventions (--repo-id, --root). Only the action column is read — no image/video decoding, so it runs in seconds.
Metrics per episode: n_frames/duration_s, median_jerk and p95_jerk (smoothness vs. isolated spikes), max_velocity, static_fraction (hesitations), and final-pose distance from the across-episode mean. Outliers flagged with the IQR rule (--k-iqr, default 1.5).
pyproject.toml: new lerobot-dataset-quality entry point.
tests/scripts/test_dataset_quality.py: 11 unit tests on synthetic trajectories with known properties (smooth vs. spiked vs. holding), outlier detection (uniform set → no flags; long episode → duration_high; divergent end pose → final_state_high), episode grouping/sorting, and the end-to-end report including error paths.
docs/source/using_dataset_tools.mdx: user-facing documentation section.

No breaking changes; purely additive and read-only.

How was this tested (or how to run locally)

Tests added: tests/scripts/test_dataset_quality.py — uv run pytest tests/scripts/test_dataset_quality.py -v (11 passed)
pre-commit run passes on all touched files (ruff, mypy, bandit, typos, prettier).
Manually verified on real SO-101 teleop datasets: flagged episodes matched the ones already identified as bad by manual review in lerobot-dataset-viz (over-long struggled attempts and episodes ending in the wrong pose).
Quick reviewer repro: lerobot-dataset-quality --repo-id lerobot/pusht (any dataset with an action feature works) or add --output-format json for raw numbers.

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green
Community Review: I have reviewed another contributor's open PR and linked it here: # (pending)

Reviewer notes

The IQR rule was chosen over z-scores because episode metrics are typically skewed (e.g. a few very long episodes); it is also what flags nothing on a perfectly uniform dataset.
median_jerk vs p95_jerk is deliberate: the median is robust to isolated corrective spikes, the p95 is there to catch exactly those spikes.
The tool prints a suggested lerobot-edit-dataset delete command but never modifies anything itself; the docs stress reviewing flagged episodes visually before deleting.
Happy to rename the command, tune default thresholds, or fold this elsewhere if maintainers prefer.

🤖 Generated with Claude Code

Adds a read-only analysis tool that computes deterministic per-episode quality metrics from recorded actions (duration, median/p95 jerk, peak velocity, static fraction, end-pose consistency) and flags statistical outliers via the IQR rule. Complements lerobot-dataset-viz for datasets too large to review episode by episode, and feeds candidate episodes to lerobot-edit-dataset --operation.type=delete_episodes. - src/lerobot/scripts/lerobot_dataset_quality.py: core metrics + CLI - pyproject.toml: lerobot-dataset-quality entry point - tests/scripts/test_dataset_quality.py: unit tests for metrics, outlier detection, episode grouping and the end-to-end report - docs/source/using_dataset_tools.mdx: user-facing documentation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maeste · 2026-06-10T13:16:31Z

Superseded by #3762, which combines this tool and #3759 into a single dataset-diagnostics PR.

github-actions Bot added documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing labels Jun 10, 2026

This was referenced Jun 10, 2026

feat(datasets): add dataset diagnostics CLIs (calibration drift + episode quality) #3762

Open

feat(datasets): add lerobot-check-calibration to detect leader/follower calibration drift #3759

Closed

maeste closed this Jun 10, 2026

maeste deleted the feature/dataset-quality branch June 10, 2026 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add lerobot-dataset-quality to flag outlier episodes#3761

feat(datasets): add lerobot-dataset-quality to flag outlier episodes#3761
maeste wants to merge 1 commit into
huggingface:mainfrom
maeste:feature/dataset-quality

maeste commented Jun 10, 2026

Uh oh!

maeste commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maeste commented Jun 10, 2026

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

Reviewer notes

Uh oh!

maeste commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant