Skip to content

Dataset quality diagnostic tool -- catches silent training failures #3280

@jashshah999

Description

@jashshah999

I built a standalone tool that runs diagnostic checks on LeRobot v3 datasets. It catches issues that are currently only discovered when training fails or produces poor results.

What it does:

pip install lerobot-doctor
lerobot-doctor lerobot/columbia_cairlab_pusht_real --max-episodes 10
[WARN] Data Distribution
  - observation.state[2]: zero variance (constant value 0.0000)
  - observation.state[3]: zero variance (constant value 0.0000)
  - observation.state[4]: zero variance (constant value 0.0000)
  - observation.state[5]: zero variance (constant value 0.0000)
  - observation.state[6]: zero variance (constant value 0.0000)
  - observation.state[7]: zero variance (constant value 0.0000)
  - action[2]: zero variance (constant value 0.0000)
  - action[3]: zero variance (constant value 0.0000)
  - action[4]: zero variance (constant value 0.0000)
  - action[5]: zero variance (constant value 0.0000)
  - action[6]: zero variance (constant value 0.0000)

[WARN] Training Readiness
  - stats.json: 5 dimension(s) have zero std -- normalization will produce NaN/Inf

This dataset is on the Hub right now. Anyone who trains on it with standard normalization gets NaN loss and has no idea why.

10 checks: metadata, temporal consistency, action quality, video integrity, data distribution, episode health, feature consistency, training readiness, anomaly detection, portability.

Tested on 12 HuggingFace datasets with 0 crashes. Some findings:

  • columbia_cairlab_pusht_real: 11 zero-variance dims, normalization will NaN
  • droid_100: gripper clipping at 1.0, 27 consecutive frozen actions
  • unitreeh1_fold_clothes: distribution shift between recording sessions
  • xarm_lift_medium: all episodes too short for ACT/Diffusion chunk sizes

Works on local datasets and HF repo IDs. No dependency on lerobot.

Repo: https://github.com/jashshah999/lerobot-doctor
PyPI: pip install lerobot-doctor

Would love feedback on what checks would be most useful. Open to contributions -- adding support for other dataset formats (RLDS, Open X-Embodiment, etc.) is on the roadmap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CIIssues related to the continuous integration pipelinedatasetIssues regarding data inputs, processing, or datasetsdependenciesConcerns about external packages, libraries, or versioningtrainingIssues related at training time

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions