Skip to content

feat(datasets): add lerobot-dataset-quality to flag outlier episodes#3761

Closed
maeste wants to merge 1 commit into
huggingface:mainfrom
maeste:feature/dataset-quality
Closed

feat(datasets): add lerobot-dataset-quality to flag outlier episodes#3761
maeste wants to merge 1 commit into
huggingface:mainfrom
maeste:feature/dataset-quality

Conversation

@maeste

@maeste maeste commented Jun 10, 2026

Copy link
Copy Markdown

Summary / Motivation

Once a teleoperation dataset grows past a few dozen episodes, reviewing each one in lerobot-dataset-viz is impractical, so bad demonstrations (struggled long attempts, sharp corrections, hesitations, wrong end poses) silently stay in the training set. This PR adds lerobot-dataset-quality, a read-only CLI that computes deterministic per-episode metrics from the recorded actions and flags statistical outliers, turning "watch 200 episodes" into "watch these 8". The flagged list feeds directly into the existing lerobot-dataset-vizlerobot-edit-dataset --operation.type delete_episodes workflow.

Related issues

What changed

  • src/lerobot/scripts/lerobot_dataset_quality.py: new analysis tool. Core logic is in pure, importable functions (compute_episode_metrics, detect_outliers, evaluate_dataset_quality) with a thin argparse main() following the lerobot-dataset-viz CLI conventions (--repo-id, --root). Only the action column is read — no image/video decoding, so it runs in seconds.
  • Metrics per episode: n_frames/duration_s, median_jerk and p95_jerk (smoothness vs. isolated spikes), max_velocity, static_fraction (hesitations), and final-pose distance from the across-episode mean. Outliers flagged with the IQR rule (--k-iqr, default 1.5).
  • pyproject.toml: new lerobot-dataset-quality entry point.
  • tests/scripts/test_dataset_quality.py: 11 unit tests on synthetic trajectories with known properties (smooth vs. spiked vs. holding), outlier detection (uniform set → no flags; long episode → duration_high; divergent end pose → final_state_high), episode grouping/sorting, and the end-to-end report including error paths.
  • docs/source/using_dataset_tools.mdx: user-facing documentation section.

No breaking changes; purely additive and read-only.

How was this tested (or how to run locally)

  • Tests added: tests/scripts/test_dataset_quality.pyuv run pytest tests/scripts/test_dataset_quality.py -v (11 passed)
  • pre-commit run passes on all touched files (ruff, mypy, bandit, typos, prettier).
  • Manually verified on real SO-101 teleop datasets: flagged episodes matched the ones already identified as bad by manual review in lerobot-dataset-viz (over-long struggled attempts and episodes ending in the wrong pose).
  • Quick reviewer repro: lerobot-dataset-quality --repo-id lerobot/pusht (any dataset with an action feature works) or add --output-format json for raw numbers.

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green
  • Community Review: I have reviewed another contributor's open PR and linked it here: # (pending)

Reviewer notes

  • The IQR rule was chosen over z-scores because episode metrics are typically skewed (e.g. a few very long episodes); it is also what flags nothing on a perfectly uniform dataset.
  • median_jerk vs p95_jerk is deliberate: the median is robust to isolated corrective spikes, the p95 is there to catch exactly those spikes.
  • The tool prints a suggested lerobot-edit-dataset delete command but never modifies anything itself; the docs stress reviewing flagged episodes visually before deleting.
  • Happy to rename the command, tune default thresholds, or fold this elsewhere if maintainers prefer.

🤖 Generated with Claude Code

Adds a read-only analysis tool that computes deterministic per-episode
quality metrics from recorded actions (duration, median/p95 jerk, peak
velocity, static fraction, end-pose consistency) and flags statistical
outliers via the IQR rule. Complements lerobot-dataset-viz for datasets
too large to review episode by episode, and feeds candidate episodes to
lerobot-edit-dataset --operation.type=delete_episodes.

- src/lerobot/scripts/lerobot_dataset_quality.py: core metrics + CLI
- pyproject.toml: lerobot-dataset-quality entry point
- tests/scripts/test_dataset_quality.py: unit tests for metrics, outlier
  detection, episode grouping and the end-to-end report
- docs/source/using_dataset_tools.mdx: user-facing documentation

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing labels Jun 10, 2026
@maeste

maeste commented Jun 10, 2026

Copy link
Copy Markdown
Author

Superseded by #3762, which combines this tool and #3759 into a single dataset-diagnostics PR.

@maeste maeste closed this Jun 10, 2026
@maeste maeste deleted the feature/dataset-quality branch June 10, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] lerobot-dataset-quality: per-episode quality metrics and outlier flagging for recorded datasets

1 participant