Stop shipping robots on vibes. Ship with evidence.
embodied-eval-kitis an open-source evaluation gate for embodied AI / robotics:
validate logs → compare runs → gate decision (GO/NO_GO) → audit-ready delivery bundle.
⭐ If this project helps your team move faster, please star it — that helps us keep improving it.
- Actually usable in delivery: built for acceptance sign-off, not just notebook demos.
- Controller-agnostic: evaluate exported logs; no need to deploy code into robot controllers.
- Safety-first decisions: explicit blockers, risk grading, fix-plan/verification loop.
- Enterprise handoff ready: manifest hashes, changelog, risk register, bundle, offline portal.
- Growth + pre-sales ready: client reports, case-study pack, and tender/procurement response pack.
pip install -e .
python -m embodied_eval.cli pipeline --baseline examples/customer.csv --candidate examples/ros_export.json --adapter auto --task_mapping integration/mappings/task_id_mapping_example.json --failure_mapping integration/mappings/failure_type_mapping_example.json --config configs/packs/factory_picking.json --out_dir deliverables/test_pipeline --manifest
python -m embodied_eval.cli bundle --run_dir deliverables/test_pipeline --out_dir bundles/client_v1 --with_portal --with_tender --zipv1.1 adds CN-first procurement/tender outputs for pre-sales and enterprise delivery:
tendercommand to generate 技术响应包(功能响应书、验收条款对照、里程碑、风险合规、支持方案)- deterministic evidence-linking from
manifest.jsonhashes when available - customizable acceptance clauses via
tender/acceptance_items_default.jsonor--items <json/yaml> bundle --with_tenderintegration for one-step client package assembly
v0.4+ productizes the toolkit for real customer delivery:
- manipulation-first 10-task suite
- rich failure taxonomy + safety deep-dive
- configurable GO / CONDITIONAL_GO / NO_GO decision engine
- baseline-vs-candidate regression highlighting with explicit
REGRESSIONflags - 1-page executive output for stakeholder review
- red-team checklist coverage runner
- service tier docs (Lite / Pro / Enterprise)
- scenario config packs (factory / hospital / lab demo)
- export bundle command for handoff delivery (folder + optional zip)
- growth kit command to turn deliverables into sales collateral
- one-command demo bundle generation for client showcases
- enterprise audit/evidence bundle with reproducibility manifest
- changelog/risk/fix-verify workflow for procurement-style acceptance
- regression-suite checks for must-run release criteria
- offline static acceptance portal generator (
portal) with local search
v0.6 keeps ingestion and validation workflows so teams can evaluate exported logs quickly:
- Minimum Logging Contract (MLC):
integration/minimum_logging_contract.md - Adapter framework (
jsonl,csv,ros_export) validatecommand for compatibility checks + data-quality scoringpipelinecommand for convert -> validate -> gate/export delivery
No need to deploy code on robot controllers; run evaluation on exported logs.
v0.6 adds optional event streaming while keeping offline logs as the default:
- Fine-grained event schema + recorder (
embodied_eval/events.py,embodied_eval/recorder.py) - Stdlib HTTP ingestion server (
serve) - Live CLI monitor (
watch) with soft safety alerts - ROS2 / Isaac Lab exporter templates (no runtime ROS dependency)
Quickstart:
python -m embodied_eval.cli serve --port 8787 --out_dir streaming_runs/test_run
python -m embodied_eval.cli watch --source streaming_runs/test_run/raw_events.jsonl --config configs/gate_default.json
python -m embodied_eval.cli evaluate --input streaming_runs/test_run/episodes.json --out reports/stream_report.md --json_out reports/stream_report.jsonv0.7 adds template-based growth outputs without any external APIs:
- case-study pack generator (
growth) from an existing run folder - public summaries (CN/EN), poster-style markdown, Xianyu listing, Upwork proposal
- pricing sheet + lead intake + scope outline templates
- one-command demo mode (
demo) that creates deliverables + growth artifacts + zip
v1.0 adds audit-traceable outputs for Chinese enterprise delivery:
--manifestforexport/pipeline/demoto producemanifest.json+manifest.mdchangelogto generate bilingual change summaries (improvements/regressions/root-cause hints)riskto seed structured risk register from gate + integration evidencefixplanto generate 整改清单 + 复测核对表, and close items across iterationsregressionsuiteto enforce must-run tasks/safety scenarios before releasebundleto assemble final client package (deliverables + evidence + changelog + risk + fix/verify)portalto generate offline static website from run/bundle (zero external deps)pluginsto discover built-in adapters/metrics/packs/templatesexamplesregistry CLI for reproducible golden demosupgradeto migrate old artifacts to v1 contract schema
cd embodied-eval-kit
python -m venv .venv
source .venv/bin/activate
pip install -e .
python -m embodied_eval.cli evaluate --input examples/sample_log.json --config configs/gate_default.json --out reports/sample_report.md --json_out reports/sample_report.json --executive_out reports/executive_eval.md
python -m embodied_eval.cli compare --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/gate_default.json --out reports/compare.md --json_out reports/compare.json --executive_out reports/executive_compare.md
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/gate_default.json --out reports/gate.md --json_out reports/gate.json --executive_out reports/executive.md
python -m embodied_eval.cli export --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/hospital.json --out_dir deliverables/test_run --zip
python -m embodied_eval.cli redteam --input examples/sample_log.json --out reports/redteam.md --json_out reports/redteam.json
python -m embodied_eval.cli validate --input examples/customer.csv --adapter csv --out reports/validate.md --json_out reports/validate.json
python -m embodied_eval.cli pipeline --baseline examples/customer.csv --candidate examples/ros_export.json --adapter auto --task_mapping integration/mappings/task_id_mapping_example.json --failure_mapping integration/mappings/failure_type_mapping_example.json --config configs/packs/factory_picking.json --out_dir deliverables/test_pipeline --zip
python -m embodied_eval.cli serve --port 8787 --out_dir streaming_runs/test_run --duration_sec 2
python -m embodied_eval.cli growth --run_dir deliverables/test_pipeline --out_dir growth_out/case_001 --project "Embodied Eval Gate" --tone casual
python -m embodied_eval.cli demo --out_dir demo_out --config configs/packs/lab_demo.json --zip
python -m embodied_eval.cli pipeline --baseline examples/customer.csv --candidate examples/ros_export.json --adapter auto --task_mapping integration/mappings/task_id_mapping_example.json --failure_mapping integration/mappings/failure_type_mapping_example.json --config configs/packs/factory_picking.json --out_dir deliverables/test_pipeline --manifest
python -m embodied_eval.cli changelog --run_dir deliverables/test_pipeline --out reports/changelog_cn.md
python -m embodied_eval.cli risk --run_dir deliverables/test_pipeline --out_dir reports/
python -m embodied_eval.cli fixplan --run_dir deliverables/test_pipeline --out_dir reports/
python -m embodied_eval.cli bundle --run_dir deliverables/test_pipeline --out_dir bundles/client_v1 --zip
python -m embodied_eval.cli regressionsuite --suite configs/regression_suite_default.json --input examples/sample_log.json --out reports/regression_suite.md --json_out reports/regression_suite.json
python -m embodied_eval.cli portal --source_dir deliverables/test_pipeline --out_dir portal_out/site_001 --title "Embodied Eval Portal" --zip
python -m embodied_eval.cli tender --source_dir deliverables/test_pipeline --out_dir tender_out/pkg_001 --project "Embodied Eval Gate" --client "DemoClient" --scenario "Factory Picking" --zip
python -m embodied_eval.cli plugins --list
python -m embodied_eval.cli examples --list
python -m embodied_eval.cli examples --run demo_basic --out_dir demo_runs/demo_basic --zip
python -m embodied_eval.cli upgrade --input deliverables/test_pipeline/gate.json --kind gate --out reports/gate_v1.json
pytest -qBackward-compatible command still works:
python -m embodied_eval.cli evaluate --input examples/sample_log.json --out reports/sample_report.mdFor non-canonical logs, evaluate / compare / gate support --allow_noncanonical:
python -m embodied_eval.cli evaluate \
--input examples/customer_jsonl.log \
--allow_noncanonical \
--adapter jsonl \
--task_mapping integration/mappings/task_id_mapping_example.json \
--failure_mapping integration/mappings/failure_type_mapping_example.json \
--out reports/customer_eval.md \
--json_out reports/customer_eval.jsonOverall and per-task KPIs:
success_ratemedian_time_secdrop_ratecollision_rateretry_ratesafety_event_rateemergency_stop_ratenear_miss_rateout_of_bounds_ratehuman_proximity_ratequality_score(0-100)safety_grade(A-F)
Also included:
- failure breakdown by type and stage
- redteam tag coverage summary
- acceptance decision with rationale / blockers / required fixes
- client-facing acceptance report rendering (CN/EN templates)
embodied_eval/
__init__.py
schema.py
metrics.py
evaluate.py
cli.py
benchmarks/
configs/
integration/
integration/streaming/
product/
data_schema/
reports/
examples/
tests/
configs/packs/factory_picking.json— throughput + drop/collision emphasisconfigs/packs/hospital.json— stricter safety thresholdsconfigs/packs/lab_demo.json— lenient demo thresholds, success-oriented
Run with a specific pack:
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/factory_picking.json --out reports/factory_gate.md --json_out reports/factory_gate.json
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/hospital.json --out reports/hospital_gate.md --json_out reports/hospital_gate.json
python -m embodied_eval.cli gate --baseline examples/baseline_log.json --candidate examples/candidate_log.json --config configs/packs/lab_demo.json --out reports/lab_gate.md --json_out reports/lab_gate.jsonUse export to generate handoff artifacts for procurement/acceptance review:
python -m embodied_eval.cli export \
--baseline examples/baseline_log.json \
--candidate examples/candidate_log.json \
--config configs/packs/hospital.json \
--out_dir deliverables/run_20260215 \
--zipGenerated artifacts include:
gate.mdexecutive.mdclient_acceptance_cn.mdclient_acceptance_en.mdgate.jsoncompare.jsonconfig_used.json
Use --manifest to add reproducibility evidence:
python -m embodied_eval.cli export \
--baseline examples/baseline_log.json \
--candidate examples/candidate_log.json \
--config configs/packs/hospital.json \
--out_dir deliverables/run_20260215 \
--manifestThis writes:
manifest.json(machine-readable hashes/snapshots)manifest.md(human-readable audit table)
Stable schemas are defined in embodied_eval/contracts/:
episode.v1gate.v1compare.v1manifest.v1risk_register.v1fix_plan.v1portal_index.v1
Validate programmatically with:
embodied_eval.contracts.validate.validate_episode_fileembodied_eval.contracts.validate.validate_artifact_file
Built-in plugin registry is available via:
python -m embodied_eval.cli plugins --listKinds:
- adapters
- metrics
- packs
- templates
Golden examples are listed in examples/registry.json.
python -m embodied_eval.cli examples --list
python -m embodied_eval.cli examples --run demo_basic --out_dir demo_runs/demo_basic --zipUse best-effort upgrader for old JSON outputs (v0.3~v0.9):
python -m embodied_eval.cli upgrade --input old_gate.json --kind gate --out gate_v1.jsonGenerate procurement-ready CN docs directly from run_dir or bundle:
python -m embodied_eval.cli tender \
--source_dir deliverables/test_pipeline \
--out_dir tender_out/pkg_001 \
--project "Embodied Eval Gate" \
--client "DemoClient" \
--scenario "Factory Picking" \
--items tender/acceptance_items_default.json \
--zipIntegrate into final bundle:
python -m embodied_eval.cli bundle \
--run_dir deliverables/test_pipeline \
--out_dir bundles/client_v1_1 \
--with_portal \
--with_tender \
--tender_project "Embodied Eval Gate" \
--tender_client "某某单位" \
--tender_scenario "工厂抓取" \
--zipvalidate -> pipeline --manifest -> changelog/risk/fixplan -> bundle --with_portal -> growth
Runnable script:
bash examples/enterprise_flow.shGenerate a fully offline static site (double-click index.html):
python -m embodied_eval.cli portal \
--source_dir deliverables/test_pipeline \
--out_dir portal_out/site_001 \
--title "Embodied Eval Acceptance Portal" \
--include_growth \
--zipIntegrate portal into final bundle:
python -m embodied_eval.cli bundle \
--run_dir deliverables/test_pipeline \
--out_dir bundles/client_v1 \
--with_portal \
--include_growth \
--zipExample logs and runnable sample commands are documented in examples/README.md.
See product packaging docs in:
product/tiers.mdproduct/deliverables_checklist.mdproduct/sales_one_pager.md
Teams shipping manipulation policies need a reproducible acceptance gate before deployment.
embodied-eval-kit makes delivery readiness auditable by standardizing trajectory logs, KPI scoring, failure analysis, and regression checks between candidate and baseline runs.
Gate decision is emitted to JSON for evaluate, compare, and gate:
decision:GO/CONDITIONAL_GO/NO_GOrisk_level:LOW/MEDIUM/HIGHrationaleblockersrequired_fixes(top 5)
- Add enum value in
embodied_eval/schema.py(FailureType). - Update benchmark docs in
benchmarks/task_list.md. - Ensure producer logs emit the new
failure_type. - Re-run
evaluate/compareand inspect breakdown sections.
- Add the task to
benchmarks/task_list.md. - Add task ID to
TaskName/MANIPULATION_TASK_IDSinembodied_eval/schema.py. - Add target time in
TASK_TARGET_TIME_SECinembodied_eval/metrics.py. - Include episodes in your log JSON.
- Optionally override task-level thresholds and quality penalty weights in config JSON.
- Re-run
evaluate/compare/gate/export.
- Export your robot episodes to the JSON schema in
data_schema/trajectory_schema.md. - Ensure each episode includes manipulation metadata:
object_drop,num_retries,grasp_attempts- expanded
safety_events failure_type(+ optionalfailure_stage,failure_notes)- optional
redteam_tags
- Run local acceptance commands:
python -m embodied_eval.cli evaluate --input /path/to/real_log.json --config configs/gate_default.json --out reports/real_robot_report.md --json_out reports/real_robot_report.json --executive_out reports/real_exec.md
python -m embodied_eval.cli compare --baseline /path/to/baseline.json --candidate /path/to/candidate.json --config configs/gate_default.json --out reports/real_compare.md --json_out reports/real_compare.json --executive_out reports/real_compare_exec.md
python -m embodied_eval.cli gate --baseline /path/to/baseline.json --candidate /path/to/candidate.json --config configs/gate_default.json --out reports/real_gate.md --json_out reports/real_gate.json --executive_out reports/real_gate_exec.md- No external API calls are used.
- Python 3.11+, minimal dependencies (
pydantic,pytest).