Skip to content

THUZeng/embodied-eval-kit

Repository files navigation

embodied-eval-kit

Stop shipping robots on vibes. Ship with evidence.

embodied-eval-kit is an open-source evaluation gate for embodied AI / robotics:
validate logs → compare runs → gate decision (GO/NO_GO) → audit-ready delivery bundle.

⭐ If this project helps your team move faster, please star it — that helps us keep improving it.

Why people star this

  • Actually usable in delivery: built for acceptance sign-off, not just notebook demos.
  • Controller-agnostic: evaluate exported logs; no need to deploy code into robot controllers.
  • Safety-first decisions: explicit blockers, risk grading, fix-plan/verification loop.
  • Enterprise handoff ready: manifest hashes, changelog, risk register, bundle, offline portal.
  • Growth + pre-sales ready: client reports, case-study pack, and tender/procurement response pack.

TL;DR

pip install -e .
python -m embodied_eval.cli pipeline --baseline examples/customer.csv --candidate examples/ros_export.json --adapter auto --task_mapping integration/mappings/task_id_mapping_example.json --failure_mapping integration/mappings/failure_type_mapping_example.json --config configs/packs/factory_picking.json --out_dir deliverables/test_pipeline --manifest
python -m embodied_eval.cli bundle --run_dir deliverables/test_pipeline --out_dir bundles/client_v1 --with_portal --with_tender --zip

Manipulation v1.1 (Tender/Procurement Response Pack)

v1.1 adds CN-first procurement/tender outputs for pre-sales and enterprise delivery:

  • tender command to generate 技术响应包(功能响应书、验收条款对照、里程碑、风险合规、支持方案)
  • deterministic evidence-linking from manifest.json hashes when available
  • customizable acceptance clauses via tender/acceptance_items_default.json or --items <json/yaml>
  • bundle --with_tender integration for one-step client package assembly

Manipulation v1.0 (Stable Contracts + Plugins + OSS Release)

v0.4+ productizes the toolkit for real customer delivery:

  • manipulation-first 10-task suite
  • rich failure taxonomy + safety deep-dive
  • configurable GO / CONDITIONAL_GO / NO_GO decision engine
  • baseline-vs-candidate regression highlighting with explicit REGRESSION flags
  • 1-page executive output for stakeholder review
  • red-team checklist coverage runner
  • service tier docs (Lite / Pro / Enterprise)
  • scenario config packs (factory / hospital / lab demo)
  • export bundle command for handoff delivery (folder + optional zip)
  • growth kit command to turn deliverables into sales collateral
  • one-command demo bundle generation for client showcases
  • enterprise audit/evidence bundle with reproducibility manifest
  • changelog/risk/fix-verify workflow for procurement-style acceptance
  • regression-suite checks for must-run release criteria
  • offline static acceptance portal generator (portal) with local search

Integration (v0.6): Connect without touching robot controller

v0.6 keeps ingestion and validation workflows so teams can evaluate exported logs quickly:

  • Minimum Logging Contract (MLC): integration/minimum_logging_contract.md
  • Adapter framework (jsonl, csv, ros_export)
  • validate command for compatibility checks + data-quality scoring
  • pipeline command for convert -> validate -> gate/export delivery

No need to deploy code on robot controllers; run evaluation on exported logs.

Streaming (optional): evaluate without deploying code into robot controller

v0.6 adds optional event streaming while keeping offline logs as the default:

  • Fine-grained event schema + recorder (embodied_eval/events.py, embodied_eval/recorder.py)
  • Stdlib HTTP ingestion server (serve)
  • Live CLI monitor (watch) with soft safety alerts
  • ROS2 / Isaac Lab exporter templates (no runtime ROS dependency)

Quickstart:

python -m embodied_eval.cli serve --port 8787 --out_dir streaming_runs/test_run
python -m embodied_eval.cli watch --source streaming_runs/test_run/raw_events.jsonl --config configs/gate_default.json
python -m embodied_eval.cli evaluate --input streaming_runs/test_run/episodes.json --out reports/stream_report.md --json_out reports/stream_report.json

Growth Kit (v0.7): turn runs into client-facing collateral

v0.7 adds template-based growth outputs without any external APIs:

  • case-study pack generator (growth) from an existing run folder
  • public summaries (CN/EN), poster-style markdown, Xianyu listing, Upwork proposal
  • pricing sheet + lead intake + scope outline templates
  • one-command demo mode (demo) that creates deliverables + growth artifacts + zip

Enterprise Delivery (v1.0): procurement/audit-ready flow

v1.0 adds audit-traceable outputs for Chinese enterprise delivery:

  • --manifest for export / pipeline / demo to produce manifest.json + manifest.md
  • changelog to generate bilingual change summaries (improvements/regressions/root-cause hints)
  • risk to seed structured risk register from gate + integration evidence
  • fixplan to generate 整改清单 + 复测核对表, and close items across iterations
  • regressionsuite to enforce must-run tasks/safety scenarios before release
  • bundle to assemble final client package (deliverables + evidence + changelog + risk + fix/verify)
  • portal to generate offline static website from run/bundle (zero external deps)
  • plugins to discover built-in adapters/metrics/packs/templates
  • examples registry CLI for reproducible golden demos
  • upgrade to migrate old artifacts to v1 contract schema

Quickstart

cd embodied-eval-kit
python -m venv .venv
source .venv/bin/activate
pip install -e .
python -m embodied_eval.cli evaluate --input examples/sample_log.json --config configs/gate_default.json --out reports/sample_report.md --json_out reports/sample_report.json --executive_out reports/executive_eval.md
python -m embodied_eval.cli compare --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/gate_default.json --out reports/compare.md --json_out reports/compare.json --executive_out reports/executive_compare.md
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/gate_default.json --out reports/gate.md --json_out reports/gate.json --executive_out reports/executive.md
python -m embodied_eval.cli export --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/hospital.json --out_dir deliverables/test_run --zip
python -m embodied_eval.cli redteam --input examples/sample_log.json --out reports/redteam.md --json_out reports/redteam.json
python -m embodied_eval.cli validate --input examples/customer.csv --adapter csv --out reports/validate.md --json_out reports/validate.json
python -m embodied_eval.cli pipeline --baseline examples/customer.csv --candidate examples/ros_export.json --adapter auto --task_mapping integration/mappings/task_id_mapping_example.json --failure_mapping integration/mappings/failure_type_mapping_example.json --config configs/packs/factory_picking.json --out_dir deliverables/test_pipeline --zip
python -m embodied_eval.cli serve --port 8787 --out_dir streaming_runs/test_run --duration_sec 2
python -m embodied_eval.cli growth --run_dir deliverables/test_pipeline --out_dir growth_out/case_001 --project "Embodied Eval Gate" --tone casual
python -m embodied_eval.cli demo --out_dir demo_out --config configs/packs/lab_demo.json --zip
python -m embodied_eval.cli pipeline --baseline examples/customer.csv --candidate examples/ros_export.json --adapter auto --task_mapping integration/mappings/task_id_mapping_example.json --failure_mapping integration/mappings/failure_type_mapping_example.json --config configs/packs/factory_picking.json --out_dir deliverables/test_pipeline --manifest
python -m embodied_eval.cli changelog --run_dir deliverables/test_pipeline --out reports/changelog_cn.md
python -m embodied_eval.cli risk --run_dir deliverables/test_pipeline --out_dir reports/
python -m embodied_eval.cli fixplan --run_dir deliverables/test_pipeline --out_dir reports/
python -m embodied_eval.cli bundle --run_dir deliverables/test_pipeline --out_dir bundles/client_v1 --zip
python -m embodied_eval.cli regressionsuite --suite configs/regression_suite_default.json --input examples/sample_log.json --out reports/regression_suite.md --json_out reports/regression_suite.json
python -m embodied_eval.cli portal --source_dir deliverables/test_pipeline --out_dir portal_out/site_001 --title "Embodied Eval Portal" --zip
python -m embodied_eval.cli tender --source_dir deliverables/test_pipeline --out_dir tender_out/pkg_001 --project "Embodied Eval Gate" --client "DemoClient" --scenario "Factory Picking" --zip
python -m embodied_eval.cli plugins --list
python -m embodied_eval.cli examples --list
python -m embodied_eval.cli examples --run demo_basic --out_dir demo_runs/demo_basic --zip
python -m embodied_eval.cli upgrade --input deliverables/test_pipeline/gate.json --kind gate --out reports/gate_v1.json
pytest -q

Backward-compatible command still works:

python -m embodied_eval.cli evaluate --input examples/sample_log.json --out reports/sample_report.md

For non-canonical logs, evaluate / compare / gate support --allow_noncanonical:

python -m embodied_eval.cli evaluate \
  --input examples/customer_jsonl.log \
  --allow_noncanonical \
  --adapter jsonl \
  --task_mapping integration/mappings/task_id_mapping_example.json \
  --failure_mapping integration/mappings/failure_type_mapping_example.json \
  --out reports/customer_eval.md \
  --json_out reports/customer_eval.json

What gets evaluated

Overall and per-task KPIs:

  • success_rate
  • median_time_sec
  • drop_rate
  • collision_rate
  • retry_rate
  • safety_event_rate
  • emergency_stop_rate
  • near_miss_rate
  • out_of_bounds_rate
  • human_proximity_rate
  • quality_score (0-100)
  • safety_grade (A-F)

Also included:

  • failure breakdown by type and stage
  • redteam tag coverage summary
  • acceptance decision with rationale / blockers / required fixes
  • client-facing acceptance report rendering (CN/EN templates)

Repo layout

embodied_eval/
  __init__.py
  schema.py
  metrics.py
  evaluate.py
  cli.py
benchmarks/
configs/
integration/
integration/streaming/
product/
data_schema/
reports/
examples/
tests/

Scenario config packs

  • configs/packs/factory_picking.json — throughput + drop/collision emphasis
  • configs/packs/hospital.json — stricter safety thresholds
  • configs/packs/lab_demo.json — lenient demo thresholds, success-oriented

Run with a specific pack:

python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/factory_picking.json --out reports/factory_gate.md --json_out reports/factory_gate.json
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/hospital.json --out reports/hospital_gate.md --json_out reports/hospital_gate.json
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/lab_demo.json --out reports/lab_gate.md --json_out reports/lab_gate.json

Export delivery bundle

Use export to generate handoff artifacts for procurement/acceptance review:

python -m embodied_eval.cli export \
  --baseline examples/baseline_log.json \
  --candidate examples/candidate_log.json \
  --config configs/packs/hospital.json \
  --out_dir deliverables/run_20260215 \
  --zip

Generated artifacts include:

  • gate.md
  • executive.md
  • client_acceptance_cn.md
  • client_acceptance_en.md
  • gate.json
  • compare.json
  • config_used.json

Use --manifest to add reproducibility evidence:

python -m embodied_eval.cli export \
  --baseline examples/baseline_log.json \
  --candidate examples/candidate_log.json \
  --config configs/packs/hospital.json \
  --out_dir deliverables/run_20260215 \
  --manifest

This writes:

  • manifest.json (machine-readable hashes/snapshots)
  • manifest.md (human-readable audit table)

v1.0 Stable Contracts

Stable schemas are defined in embodied_eval/contracts/:

  • episode.v1
  • gate.v1
  • compare.v1
  • manifest.v1
  • risk_register.v1
  • fix_plan.v1
  • portal_index.v1

Validate programmatically with:

  • embodied_eval.contracts.validate.validate_episode_file
  • embodied_eval.contracts.validate.validate_artifact_file

Plugins

Built-in plugin registry is available via:

python -m embodied_eval.cli plugins --list

Kinds:

  • adapters
  • metrics
  • packs
  • templates

Examples

Golden examples are listed in examples/registry.json.

python -m embodied_eval.cli examples --list
python -m embodied_eval.cli examples --run demo_basic --out_dir demo_runs/demo_basic --zip

Upgrade old artifacts

Use best-effort upgrader for old JSON outputs (v0.3~v0.9):

python -m embodied_eval.cli upgrade --input old_gate.json --kind gate --out gate_v1.json

Tender Pack (v1.1)

Generate procurement-ready CN docs directly from run_dir or bundle:

python -m embodied_eval.cli tender \
  --source_dir deliverables/test_pipeline \
  --out_dir tender_out/pkg_001 \
  --project "Embodied Eval Gate" \
  --client "DemoClient" \
  --scenario "Factory Picking" \
  --items tender/acceptance_items_default.json \
  --zip

Integrate into final bundle:

python -m embodied_eval.cli bundle \
  --run_dir deliverables/test_pipeline \
  --out_dir bundles/client_v1_1 \
  --with_portal \
  --with_tender \
  --tender_project "Embodied Eval Gate" \
  --tender_client "某某单位" \
  --tender_scenario "工厂抓取" \
  --zip

Recommended enterprise flow

validate -> pipeline --manifest -> changelog/risk/fixplan -> bundle --with_portal -> growth

Runnable script:

bash examples/enterprise_flow.sh

Offline Acceptance Portal

Generate a fully offline static site (double-click index.html):

python -m embodied_eval.cli portal \
  --source_dir deliverables/test_pipeline \
  --out_dir portal_out/site_001 \
  --title "Embodied Eval Acceptance Portal" \
  --include_growth \
  --zip

Integrate portal into final bundle:

python -m embodied_eval.cli bundle \
  --run_dir deliverables/test_pipeline \
  --out_dir bundles/client_v1 \
  --with_portal \
  --include_growth \
  --zip

Example logs and runnable sample commands are documented in examples/README.md.

Service tiers

See product packaging docs in:

  • product/tiers.md
  • product/deliverables_checklist.md
  • product/sales_one_pager.md

Why this matters

Teams shipping manipulation policies need a reproducible acceptance gate before deployment.
embodied-eval-kit makes delivery readiness auditable by standardizing trajectory logs, KPI scoring, failure analysis, and regression checks between candidate and baseline runs.

Acceptance Decision Outputs

Gate decision is emitted to JSON for evaluate, compare, and gate:

  • decision: GO / CONDITIONAL_GO / NO_GO
  • risk_level: LOW / MEDIUM / HIGH
  • rationale
  • blockers
  • required_fixes (top 5)

Extend failure taxonomy

  1. Add enum value in embodied_eval/schema.py (FailureType).
  2. Update benchmark docs in benchmarks/task_list.md.
  3. Ensure producer logs emit the new failure_type.
  4. Re-run evaluate / compare and inspect breakdown sections.

Add a new task

  1. Add the task to benchmarks/task_list.md.
  2. Add task ID to TaskName/MANIPULATION_TASK_IDS in embodied_eval/schema.py.
  3. Add target time in TASK_TARGET_TIME_SEC in embodied_eval/metrics.py.
  4. Include episodes in your log JSON.
  5. Optionally override task-level thresholds and quality penalty weights in config JSON.
  6. Re-run evaluate / compare / gate / export.

Plug in real robot logs later

  1. Export your robot episodes to the JSON schema in data_schema/trajectory_schema.md.
  2. Ensure each episode includes manipulation metadata:
    • object_drop, num_retries, grasp_attempts
    • expanded safety_events
    • failure_type (+ optional failure_stage, failure_notes)
    • optional redteam_tags
  3. Run local acceptance commands:
python -m embodied_eval.cli evaluate --input /path/to/real_log.json --config configs/gate_default.json --out reports/real_robot_report.md --json_out reports/real_robot_report.json --executive_out reports/real_exec.md
python -m embodied_eval.cli compare --baseline /path/to/baseline.json --candidate /path/to/candidate.json --config configs/gate_default.json --out reports/real_compare.md --json_out reports/real_compare.json --executive_out reports/real_compare_exec.md
python -m embodied_eval.cli gate --baseline /path/to/baseline.json --candidate /path/to/candidate.json --config configs/gate_default.json --out reports/real_gate.md --json_out reports/real_gate.json --executive_out reports/real_gate_exec.md

Notes

  • No external API calls are used.
  • Python 3.11+, minimal dependencies (pydantic, pytest).

About

Open-source embodied AI eval gate: validate logs -> compare runs -> GO/NO-GO decision -> audit/tender-ready delivery bundle + offline portal.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages