All notable user-visible changes to WorldForge are recorded here.
This project follows the spirit of Keep a Changelog. Versioning is currently pre-1.0, so minor releases may still include breaking changes when the public API needs to tighten.
- The wheel now builds from the sdist again. A redundant
[tool.hatch.build.targets.wheel.force-include]forproviders/runtime_manifestsduplicatedruntime_manifests/__init__.py(already included bypackages), which madehatchling's wheel-from-sdist build fail.only-packages = truealready ships the JSON manifests, so the force-include was removed;scripts/test_package.shpasses again.
- Strategic pivot. WorldForge is now positioned as a harness framework for building
world-model-based workflows for physical AI — the application builder's counterpart to
model-training stacks like Stable World Model. The project is narrowed to one backbone loop:
planning and scoring action candidates with an action-conditioned predictive world model, in
latent space. README, CLAUDE.md, AGENTS.md, and core docs are reframed around that loop, and the
new
specs/latent-planning-core/triad records the narrowed scope and the staged removal plan. - Evaluation suites no longer use the symbolic
Worldruntime. The built-inphysicsandplanningsuites now drive the provider capability surface directly:physicscallsforge.predictover a plain world-state dict (prediction-determinism and action-response checks), andplanningsolves each scenario withLatentMPCControlleroverforge.score_actionsas a cost oracle. Theplanningsuite now requires thescorecapability instead ofpredict(satisfied by the defaultmockprovider); all suite ids and scenario names are unchanged. Breaking:EvaluationContext.worldis removed — custom evaluators must read the forge capability surface (for examplecontext.forge.predict(...)) instead of aWorld. The internalworldparameter is dropped from the suite runner methods;run_report/run_report_artifactsstill accept an ignoredworldkeyword for backwards compatibility withWorld.evaluate. - Bundled demos no longer use the symbolic
Worldruntime. The packaged demos (worldforge-demo-leworldmodel,worldforge-demo-lerobot,worldforge-demo-so101-replay-trace,worldforge-demo-rerun, plus the policy+score candidate lab, DimOS Go2 replay arena, embodied policy replay comparison, and shared demo helpers) now drive the provider capability surface directly —forge.predict/forge.score_actions/forge.select_actions/LatentMPCController— instead ofcreate_world/World.plan/World.execute_plan/World.add_object. The four console-script entry points andmain()signatures are unchanged. Demo output shapes are now capability-centric: they report the selected action, candidate scores,best_index, provider, and metadata, and they no longer carry world-persistence fields (saved_world_id,saved_worlds,state_dirpersistence, exported-world JSON) or a--state-dirflag. The sharedmake_blue_cube()helper returns a standaloneSceneObjectwith a stable id.
- Removed the symbolic
Worldruntime, its local JSON persistence, theworldforge worldCLI command, and the related public surface (World,Plan,Prediction,PlanExecution,Comparison,StructuredGoal,HistoryEntry, andDoctorReport.world_count). Planning now runs over plain world-state dicts:forge.predictrolls aworld_statedict forward andLatentMPCControllerplans by scoring action candidates over thescore/predictsurface. Theworldforge predictCLI seeds a state dict, runspredict, and prints the result without persisting anything. Durable world-state persistence is host-owned. - Removed the world-authoring surface that sits outside the backbone loop: the JSON scenario DSL
and scenario matrices/galleries (
worldforge.scenariosplus theworldforge scenariocommand), world diff/patch (worldforge.world_diffplusworldforge world diff), world migration previews (worldforge.world_migration_previewplusworldforge world migration-preview), and local-state preflight (worldforge.persistence_preflightplusworldforge world preflight) — along with their public exports, documentation pages, and CLI subcommands.
- The built-in
mockprovider now implements thescorecapability as a deterministic cost oracle (costs,lower_is_better, goal-distance wheninfo["goal"]["target"]is present). This makes the latent MPC backbone loop and score-only workflows runnable checkout-safe on the default provider, and addssample_contract_score_info/sample_contract_score_action_candidatesto theworldforge.testingcontract helpers. Score-only capability negotiation is now satisfiable bymockwithout an optional runtime. - Added
examples/latent_mpc_planning.py, a runnable, checkout-safe latent MPC planning loop that drivesLatentMPCControllerover an in-example score (cost) oracle: it samples action candidates, scores them as costs, keeps elites, refits, and executes the lowest-cost action under a receding horizon — no credentials, GPU, or robot required. - Added
worldforge-demo-so101-replay-trace, a checkout-safe SO-101 manipulation replay demo that scores deterministic 6D joint-action candidates, selects the lowest-cost pick-and-place action, mock-executes the selected object placement, and emits a reusable robot decision trace with candidate scores, selected action, measured replay outcome, and counterfactual rejected actions. The demo is shaped after publiclerobot/svla_so101_pickplacemetadata but does not install LeRobot, torch, DimOS, or connect robot hardware. - Added
examples/go2-controlbench-decisiontrace/run.py, a checkout-safe Go2 Air ControlBench DecisionTrace replay that consumes a compact publicespejelomar/go2-air-controlbench-v1positive-regret trace, reranks candidate Unitree sport-mode commands through WorldForge'sscorecapability, and reports measured native-odom regret against the best counterfactual command without importing DimOS, Unitree SDKs, Hugging Face datasets, or connecting hardware. - Added a first-slice latent-MPC controller for score-provider planning. The new
worldforge.controlmodule exposesLatentMPCController,PlannerConfig,ScoreCandidateEncoder,ScoreCandidateBatch, andActionPlanCandidateEncoder, andWorld.plan(planner="latent-mpc", score_provider=..., score_info=..., goal_info=..., planner_config=...)now runs a pure-Python Gaussian CEM solve overscore_actions(...). The planner requires an explicit score provider, recordscontrol_mode="mpc",optimizer="cem", candidate counts, and per-iteration score/cost diagnostics in plan metadata, and leaves tensor encoding, environment stepping, robot execution, optional ML runtimes, and policy warm-start outside the base package. - Added a non-interactive TensorBoard launcher CLI:
worldforge-open-tensorboard --logdir <path> [--probe] [--no-browser] [--keep-running] [--ready-timeout 60] [--poll-interval 0.5]. Wraps the same launch / poll / probe flow the TUI uses behind a single shell command;--probefetches the index page and asserts it contains theTensorBoardmarker so wiring changes can be validated end-to-end from a script. Exit codes are0on success,1on ready timeout / probe failure,2on bad input. The newworldforge.harness.tensorboard_launchermodule owns the shared helpers (viewer_command,port_open,wait_until_ready,probe_html,launch); the TUI now imports them so both surfaces stay in sync. Issue #310.
- Removed the generative provider surface from the current public contract: generate, transfer, and free-form reason capabilities, their public result models, capability protocols, provider registration helpers, CLI benchmark operations, fixture corpus entries, and provider conformance checks are no longer part of WorldForge.
- Removed the Cosmos media and Runway provider integrations, smoke scripts,
runtime manifests, payload fixtures, and generated provider catalog entries.
Cosmos-Policy remains as a host-owned robotics
policyadapter. - Removed the old world-creation/provider/evaluation/benchmark Harness TUI screens and launch command. The optional Textual surface is now scoped to the robotics showcase report UI; planning, diagnostics, benchmarks, run history, and adapter workbench flows remain CLI/library workflows.
- Provider routing failed-attempt records now redact exception messages before
serialization.
RoutingAttemptvalidates optional reason/error text androute_capability(...)stores a sanitizedstr(exc)so bearer tokens, API key assignments, and signed artifact URL query strings do not leak through routing result artifacts even when an adapter raises a raw exception. RoutingResultnow rejects incoherent public artifacts: succeeded results must include a matching final succeeded attempt, chosen provider, and value; failed results cannot carry stale values or succeeded attempts; and attempt capabilities must match the result capability.WorkflowTracenow rejects a manually supplied top-level status that contradicts the derived step status, so attachable trace artifacts cannot reportsuccesswhile containing failed, running, pending, or skipped-only step state.- Raw
run_manifest.jsonvalidation now enforces the same status enum asLiveSmokeRunManifestconstruction, rejecting externally supplied manifests whose status is notpassed,failed, orskipped. - Live-smoke evidence registry validation now rejects stale mixed state:
passed/failed rows must link sanitized evidence and cannot carry a
skip_reason, while skipped/not-run rows must carry askip_reasonand cannot link anartifact_path. - The quality dashboard now treats row-level core-performance budget failures
as failed even when a stale top-level
passed: trueflag is present, and marks malformed core-performance artifacts as warnings instead of reporting that all recorded rows passed. - The quality dashboard now sanitizes raw-detail object keys as well as values before writing JSON or Markdown, preserving colliding redacted keys with deterministic suffixes so hostile artifact metadata cannot leak host-local paths, signed URLs, or secret-shaped key names.
- Dependency-audit evidence now sanitizes raw-detail object keys as well as values before writing safe-to-attach JSON or Markdown, preserving colliding redacted keys with deterministic suffixes so direct API callers cannot leak host-local paths, signed URLs, or secret-shaped key names.
- Release notes drafts now sanitize changelog entries, closed issue metadata, release-evidence text, and maintainer caveats with the shared observable-text redactor, so token assignments, bearer headers, signed URLs, and host-local paths do not leak into draft Markdown.
scripts/smoke_gr00t_policy.py --start-servernow redacts forwarded secret-shaped server arguments and host-local paths from its startup command line before writing stderr, while still launching the raw host-owned command.- Release notes drafts now derive
needs-validation-reviewfrom failedvalidation_gatesrows as well asvalidation_summary.failed, so stale release-evidence summaries cannot mark a draft ready while an individual gate records failure. - Release evidence reports now redact host-local paths for externally supplied
artifacts and manifests. Repo-relative artifacts remain linkable, while
outside-checkout artifacts are recorded as
<host-local-path>/<name>with their hashes and sizes instead of leaking/Users,/private, or temporary workspace paths into attachable JSON or Markdown. - Release evidence validation gate records now sanitize command output tails, skipped-gate reasons, and known limitations before writing JSON or Markdown, so failed gate logs cannot leak bearer tokens, signed URLs, secret-shaped assignments, or host-local paths into attachable release artifacts.
- Live-smoke
run_manifest.jsonartifact references no longer preserve host-local absolute paths. Manifest builders now serialize local artifacts under the run directory as relative paths, reject absolute paths outside that directory, and continue stripping query strings from remote artifact URLs. WorldForge.embed()now rejects empty or whitespace-only text before provider dispatch, matching the capability fixture contract. Score and policy+score planning now choose a direction-aware success heuristic: lower-is-better scores keep the inverse cost heuristic, bounded higher-is-better utility scores use the best utility value, and unbounded utility scores fall back to a neutral probability rather than pretending a raw utility is a calibrated probability.ActionScoreResultandworldforge.testing.assert_score_conformance(...)now reject score results whosebest_indexcontradictslower_is_better, so planners and adapter contract tests cannot accept the wrong candidate for the declared score direction.- Capability-specific provider conformance helpers now normalize
WorldForgeErrorvalidation failures from invalid public result construction and configured-providerProviderErrors into explicitAssertionErrorcontract failures for predict, embed, score, and policy checks. The helpers also revalidate returned mutable result objects for finite numeric fields and JSON-native score/policy payloads before accepting provider output. - The
tshortcut inRoboticsShowcaseAppno longer fails silently because TensorBoard cannot importpkg_resources. The launcheduvxcommand now pins--with "setuptools<81"sopkg_resourcesis available (setuptools 81+ removed it; TensorBoard still imports it at startup). Capturedtensorboard.stderr.logfiles for previous-version runs showed theModuleNotFoundError; the new command resolves a working environment. Issue #310. - The
tshortcut inRoboticsShowcaseAppno longer opens the browser before TensorBoard has bound the port (causing a blank page on first-runuvxresolves). The fixedset_timer(2.5, ...)is replaced with a Textual background worker that pollslocalhost:6006every ~0.5 s for up to ~60 s and only callswebbrowser.openonce the port responds. The action also stops swallowing the TensorBoard subprocess's output: stdout/stderr are now captured totensorboard.stdout.logandtensorboard.stderr.loginside the run's log directory. Timeouts now emit an error notification pointing at the stderr log path. Issue #308. - The
tshortcut inRoboticsShowcaseAppnow actually shows TensorBoard. The launched command pins--port 6006, and the action additionally schedules awebbrowser.open("http://localhost:6006/")via a Textual timer (~2.5 s after launch) so the run becomes visible without manual browser navigation - parity with the desktop Rerun viewer. The URL is also surfaced inRoboticsTensorBoardPaneand in the action notification so headless / remote users can copy-paste it; a fallback warning is emitted whenwebbrowser.openreturns false. Issue #306.
- Added a
tkeybinding toRoboticsShowcaseAppthat launches the run's TensorBoard log directory viauvx --from "tensorboard>=2.16,<3" tensorboard --logdir <path>in a detached subprocess, mirroring the existingoshortcut for Rerun. A newRoboticsTensorBoardPanesurfaces the resolved log directory, run name,events_writtenstatus, the viewer command, and the shortcut hint. The pane and binding gracefully degrade when the run summary has no"tensorboard"block. Issue #304. - Added an optional TensorBoard bridge for inspecting the LeWorldModel
checkpoint used during local inference in the robotics showcase. The new
worldforge.tensorboardmodule exposesTensorBoardLogConfig,TensorBoardSession,TensorBoardCheckpointInspector, and acreate_tensorboard_inspectorhelper. It writes sanitized provenance text, per-candidate cost scalars and a histogram, latency metrics, and a per-provider-event text feed to a localtfeventsdirectory.scripts/robotics-showcaseenables the writer by default (unless--no-tensorboard,--health-only, or--json-onlyis passed) and gains--tensorboard,--tensorboard-logdir,--tensorboard-run-name,--tensorboard-flush-secs, and--no-tensorboardflags. Install withuv add "worldforge-ai[tensorboard]". Base WorldForge still depends only onhttpx. Seedocs/src/tensorboard.mdfor the tag layout and programmatic API. Issue #302. - Added a public-API snapshot test.
tests/fixtures/public_api/exports.jsonrecords the current export set forworldforge,worldforge.testing,worldforge.observability,worldforge.providers, andworldforge.capabilities;tests/test_public_api_snapshot.pyfails loudly on any add/rename/remove with a clear diff and the regenerate command. Intentional public-API changes regenerate the snapshot viauv run python scripts/update_public_api_snapshot.pyor by settingWORLDFORGE_UPDATE_PUBLIC_API_SNAPSHOT=1.docs/src/api-stability.mdnow points at the snapshot as the authoritative Stable surface. - Added scenario inheritance via the new optional
extendsfield (schema version 2). A child scenario can name a single relative parent path; resolution merges top-level keys with the child winning, detects cycles, rejects absolute paths, bounds chain depth, and is validated end-to-end through new fixtures underexamples/scenarios/inheritance/plus tests intests/test_scenario_inheritance.py. Schema version 1 scenarios continue to validate without changes. - Added a typed retention policy and
worldforge runs prunesubcommand. The newworldforge.runs_prunemodule shipsRunsRetentionPolicy,PruneCandidate,PruneReport,plan_prune,apply_prune, andparse_runs_retention. Default behavior is dry-run;--applyactually removes selected directories.--max-age-days,--keep-latest, and repeatable--family <kind>control the policy; a 24-hour safety window blocks deletion of fresh runs unless--max-age-days=0is passed. The delete path checkstarget.is_relative_to(<workspace>/runs)on the fully-resolved paths so symlinked or crafted run paths cannot escape the workspace.keep_latestis scoped to the family filter so a non-matching newer run cannot consume a keep slot when--familyis set. CLI flags passed in either argparse form (--max-age-days 7or--max-age-days=7) override the profile. Config profiles can carry aruns_retentionblock consumed via--retention-profile <path>with explicit CLI flags still overriding. Invalid retention profile shapes raise typedWorldForgeErrorrather than leakingValueError/AttributeError, andshutil.rmtreefailures during--applyare wrapped asWorldForgeErrorfor a stable CLI envelope. Documentation lives indocs/src/run-index.mdunder the retention section. - Added an adoption case-study gallery, reusable case-study template, Adoption Story issue template, and smoke tests for future submitted adoption stories.
- Added a runnable capability protocol mini-demo with docs and tests for in-process predictor, policy, and cost registration.
- Added a checkout-safe external provider package demo workflow.
scripts/demo_showcases.py run external-provider-packagegenerates a temp provider package, provesworldforge.providersentry-point discovery, disabled discovery, duplicate-name handling, and missing optional dependency skip reporting, and preserves a safe discovery report without publishing or mutating tracked source. - Expanded the custom evaluation suite example into a checkout-safe walkthrough. The demo now runs a
deterministic custom suite against
mock, preserves JSON/Markdown/HTML/failure-gallery artifacts with provenance, and includes one controlled failed case for report review without claiming model quality. - Added a checkout-safe policy+score candidate lab. The demo builds deterministic bounded move candidates, preserves raw policy actions, scores candidate plans, records the selected action and workflow trace, and captures invalid candidate bounds plus missing-translator failures without requiring a robot, simulator, or checkpoint.
- Added a checkout-safe fixture drift review walkthrough. The demo builds a temp fixture snapshot manifest, shows missing fixture, digest drift, schema-version drift, unsafe path, and intended-update review outputs, and preserves the approved update path without touching tracked fixtures.
- Added a checkout-safe capability negotiation preflight demo. The workflow preserves negotiation JSON/Markdown for ready, missing-config, missing-dependency, unsupported, and not-registered cases across predict, embed, score, policy+score, and evaluation workflow shapes without installing dependencies or executing fallback workflows.
- Added a checkout-safe embodied policy replay comparison. It compares LeRobot, GR00T, and Cosmos-Policy policy contracts side by side, preserves provider-specific raw action metadata, records missing-translator blockers, and links each provider to its prepared-host live smoke.
- Added a scenario gallery under
examples/scenarios/covering successful world setup, intentionally failed expectations, invalid action triage, evaluation-oriented setup, and report/export output through the existingworldforge scenarioCLI. - Added a checkout-safe release readiness drill script that renders clean-pass and controlled failure release-evidence artifacts, reports host-owned optional-runtime skips, and records the first failed gate without publishing, tagging, signing, or creating a release.
- Added a non-developer evidence review demo that builds a static HTML/JSON/Markdown review package from evaluation, benchmark, world-diff, and issue-bundle artifacts while escaping display text and marking unsafe host-local references as local-only.
- Added a provider failure mode gallery demo and docs page covering fixture-backed parser errors, provider errors, retry exhaustion, missing config, optional-runtime setup, scaffold boundaries, unsupported behavior, and safe artifact handling with first triage commands.
- Added typed provider lifecycle diagnostics with
ProviderLifecycleResult,ProviderLifecycleStatus, optional provider-ownedpreflight,warmup, andteardownhooks, safe no-op/skipped defaults, protocol-wrapper support, and doctor/provider-info JSON output for lifecycle readiness and skip reasons. - Added
worldforge runs compare --mode regressionfor baseline-vs-candidate review across preserved benchmark, evaluation, and demo-showcase runs. Regression reports include metric deltas, budget status changes, new and removed failures, safe artifact drift, provenance differences, and unsafe artifact exclusion counts in JSON, Markdown, CSV, and HTML. - Added scenario parameter matrices for bounded checkout-safe sweeps. Scenario files can now
declare JSON-native
matrix.parameters, use whole-value placeholders for provider names, object positions, action targets, and expected artifact values, validate every expanded case before execution, and return aggregate pass/fail counts plus failed-case details fromworldforge scenario run. - Added evaluation dataset manifest contracts.
worldforge eval --dataset-manifest <path>now cites schema-versioned manifest references in provenance, with validation for local fixture paths, remote references, host-owned asset records, checksums, license/provenance/privacy/safety fields, and evidence-bundle copying of safe source-controlled manifest files without embedding datasets. - Added
worldforge provider contractfor external adapter authors. It runs metadata and capability-aware provider contract checks for registered providers or directmodule:factorypaths, emits safe-to-attach JSON/Markdown evidence with skipped host-owned checks and validation commands, and keeps live provider calls behind an explicit--liveflag. - Added runtime asset manifests for prepared-host optional runtimes. LeWorldModel and
LeRobot/LeWorldModel smoke outputs now include safe
runtime_assetsreferences in run manifests, while full local-only paths, cache roots, and checkpoint bytes stay out of attachable evidence. - Added non-secret JSON/TOML configuration profiles for repeatable eval and benchmark CLI defaults.
Profiles reject secret-looking keys and unsafe paths, and preserved run manifests now include
safe
config_profileprovenance with the profile digest instead of profile contents. - Added a safe report-renderer registry for comparison and evidence bundle artifacts. Built-in JSON/Markdown/CSV/HTML renderers keep their output unchanged, while external code can register validated safe-to-attach or local-only renderers without file-based plugin loading.
- Added read-only world migration previews through
worldforge world migration-preview. The report covers persisted worlds and exported world JSON, schema versions, required canonicalization changes, invalid fields, unsafe IDs, bounding-box corrections, safe-to-attach status, and whether an explicit migration can be applied safely without silently rewriting local state. - Added schema-versioned workflow trace artifacts for composed operations. Plans now include sanitized trace metadata, evaluation reports export trace JSON/Markdown and render trace tables in HTML, provider events can be converted into trace steps, and the optional Rerun artifact logger can record workflow traces without changing provider capability semantics.
- Added
docs/src/artifact-schemas.md, an ownership and migration map for public and semi-public JSON artifact families. The page records each schema's owner, version field, validation surface, docs/CLI entry point, and migration rules, and the docs test suite now guards schema-version exports and required artifact families against missing ownership notes. - Added
scripts/check_optional_import_boundaries.py, a checkout-safe audit that statically checks optional runtime imports and verifies base package, CLI, Rerun, provider, and non-TUI harness imports do not load Textual, Rerun, torch, stable-worldmodel, LeRobot, GR00T, or Cosmos-Policy runtime packages. - Added
scripts/check_docs_snippets.py, a marker-based docs snippet gate for selected Python and JSON examples across the Python API, scenarios, provider routing, external provider, benchmarking, artifact, and report docs. The gate executes Python snippets in a temp workspace, parses JSON snippets, applies scenario and benchmark schema checks, and requires explicit host-owned, credentialed, or illustrative skip markers. - Added
worldforge.testingdeterministic controls for artifact and report tests:DeterministicClock,DeterministicIdFactory,deterministic_run_workspace,stable_snapshot,stable_path, andstable_json_dumps. Evidence bundle generation, issue bundle tests, release evidence tests, preserved benchmark snapshots, scenario result snapshots, and live-smoke run manifest tests can now pin clocks, IDs, temp paths, volatile fields, and sorted JSON without weakening real runtime timing. - Added a generated Provider Configuration Index that derives each catalog provider's required and optional inputs, optional packages, credential gates, prepared-host assets, default request timeouts, first diagnostic command, smoke command, and evidence level from provider metadata and runtime manifests. The provider-docs generator now checks that index alongside the provider catalog tables.
- Added user-facing error-message regression coverage for CLI world/scenario failures, unsupported capability names, provider budget failures, and secret/path redaction. CLI errors now include a command owner context plus a first triage step while redacting signed URLs, secret assignments, and host-local paths.
- Added contributor task starter packs for provider, docs-only, demo, artifact/report, evaluation/benchmark, and CLI/operator work, with issue-template links and docs tests guarding required sections, validation commands, evidence artifacts, docs/changelog expectations, and review checklists.
- Added
scripts/generate_release_notes.py, a maintainer-editable release notes draft generator that assemblesCHANGELOG.md, optional closed GitHub issue metadata, release evidence JSON, validation summaries, docs/public-surface links, caveats, compatibility notes, and host-owned optional runtime evidence without publishing a GitHub release or changing tag/signing workflows. - Added
scripts/generate_dependency_audit_evidence.py, a checkout-safe dependency-audit evidence wrapper that runs the documented lockeduv exportpluspip-auditflow through a temporary requirements file and writes JSON/Markdown summaries with tool versions, dependency-set digest, vulnerability summaries, explicit ignore rationales, sanitized command output, and first triage steps. - Added
scripts/generate_quality_dashboard.py, a local quality dashboard generator that reads release evidence, dependency-audit evidence, and core-performance output, then writes JSON and Markdown summaries with normalized pass/fail/warning/skip/not-run statuses, command lines, timestamps, raw failure details, skipped host-owned checks, and the first failed gate. - Added a public custom evaluation-suite authoring API:
EvaluationSuite.custom(...), process-localEvaluationSuite.register(...)/from_registered(...), callableEvaluationScenario.from_callable(...),EvaluationContext, andEvaluationScenarioOutcome. Custom reports reuse the existing provenance, failure-gallery, artifact, and claim-boundary machinery while rejecting non-JSON metric payloads. - Added provider-agnostic action candidate helpers for score and policy+score workflows:
cartesian_offset_candidates(...),object_near_candidates(...),swap_action_candidates(...),bounded_move_grid_candidates(...),normalize_action_candidates(...), andaction_candidates_to_score_payload(...). - Added fixture snapshot governance for source-controlled JSON fixtures. The new
worldforge.testing.fixture_snapshotshelpers andscripts/manage_fixture_snapshots.pyvalidatetests/fixtures/fixture-snapshots.jsonagainst capability fixtures, provider payload fixtures, benchmark inputs, scenario files, and runtime asset manifests, with review output that separates accidental drift from entries markedintended-update. - Added a checkout-safe GR00T PolicyClient replay flow in the robotics showcase flows. The flow replays a
sanitized saved policy response through
GrootPolicyClientProvider, validateseef_9d,gripper_position, andjoint_positiontensor shapes, translates the trajectory into WorldForge actions, and preserves a replay artifact without requiring CUDA, checkpoints, raw observations, private endpoints, or GPU logs. - Added
docs/src/roadmap-expansion-2.md, a second 30-issue roadmap expansion across production-grade quality/DevX/docs, demos and end-to-end showcases, and new features. The batch focuses on artifact schema governance, executable docs snippets, optional dependency import boundaries, provider configuration indexing, external-provider demos, capability preflight demos, scenario matrices, runtime asset manifests, report renderer extension points, and composed workflow traces. - Added static HTML report export for evaluation reports, benchmark reports,
preserved-run comparisons, and issue-ready bundles.
worldforge eval,worldforge benchmark,worldforge runs compare, andworldforge runs bundleaccept--format html;worldforge runs bundlealways also writessummary.htmlandissue.htmlto the bundle directory. The HTML output is self-contained — inline CSS only, no JavaScript, no external assets, no anchor tags. All user-supplied text is escaped viahtml.escape. New public surface:worldforge.html_report.render_evaluation_html,render_benchmark_html,render_comparison_html,render_evidence_bundle_html,render_issue_bundle_html,HTML_REPORT_SCHEMA_VERSION. Documentation lives atdocs/src/html-reports.md, including when to prefer HTML versus JSON/Markdown. - Added JSON-native world state diff and patch artifacts.
worldforge world diff <source> <target>walks two persisted worlds (default) or two exported JSON files (with--source-path --target-path) and emits a schema-versioned diff covering top-level fields (name,provider,description,step,metadata), scene-object additions/removals/updates with before/after payloads, and a history summary. The companionWorldPatch.from_diff(diff)andapply_patch(state, patch)helpers apply changes to a base snapshot, validating each operation throughSceneObject,Position, andBBoxso traversal-shaped IDs, incoherent bounding boxes, malformed pose payloads, or removing missing objects raiseWorldStateErrorinstead of silently corrupting state. New public surface:worldforge.world_diff.diff_worlds,diff_worlds_from_paths,apply_patch,WorldDiff,WorldPatch,ObjectChange,WorldFieldChange,WORLD_DIFF_SCHEMA_VERSION. Documentation lives atdocs/src/world-diff.md. - Added a JSON-native scenario definition format and a runner. The new
worldforge.scenariosmodule shipsScenario,ScenarioObjectSpec,ScenarioAction,ScenarioExpectedArtifact, andScenarioResult. A scenario captures a checkout-safe recipe — provider, initial scene objects, an ordered sequence of typed actions (move_to,spawn_object,predict), and expected artifacts (object_count,step,object_position) — that runs end-to-end throughWorld.predictwithout arbitrary Python execution.worldforge scenario validate <path>andworldforge scenario run <path>validate and execute scenario files; the run exits non-zero when any expectation fails. New public surface:load_scenario,parse_scenario,run_scenario,Scenario,ScenarioAction,ScenarioObjectSpec,ScenarioExpectedArtifact,ScenarioExpectationCheck,ScenarioResult,SCENARIO_SCHEMA_VERSION,SCENARIO_ACTION_KINDS. Sample scenarios live underexamples/scenarios/. Documentation lives atdocs/src/scenarios.md. - Added a local run artifact index.
worldforge runs index --workspace-dir <dir>walks<dir>/runs/read-only and emits a sanitized summary of every preserved run workspace, with optional filters for provider (substring), capability, status, date range, and safe-artifact type. Output is JSON, Markdown, or CSV. Stale or malformed run directories surface as typed issue rows (manifest-missing,manifest-unreadable,manifest-invalid-json,manifest-not-object) instead of crashing the walk. New public surface:worldforge.harness.run_index.build_run_index,RunIndex,RunIndexIssue,RUN_INDEX_SCHEMA_VERSION. Documentation lives atdocs/src/run-index.md, including retention/cleanup interaction guidance. - Added typed provider routing and fallback policies. The new
worldforge.provider_routingmodule shipsProviderRoutingPolicy,RoutingAttempt,RoutingResult, androute_capability(policy, forge, *, invoke). Routing tries a preferred provider followed by an ordered fallback list, validates capability compatibility before invoking each provider, and records every attempt — succeeded, failed, skipped-not-registered, or skipped-incompatible — in the returned result. Failures are captured with the exception class name andstr(exc)and never silently masked; the underlying observable-capabilityProviderEventstream is preserved unchanged. New public surface:ProviderRoutingPolicy,RoutingAttempt,RoutingResult,ROUTING_ATTEMPT_STATUSES,route_capability. Documentation lives atdocs/src/provider-routing.md, including guidance on when fallback is and is not appropriate. - Added Python entry-point discovery for external provider packages. Third-party adapters can
register through the
worldforge.providersentry-point group;WorldForgeauto-registers the resulting providers when theirconfigured()check passes and records typed skip reasons (missing dependency, duplicate name, non-callable factory, factory raised) on the newWorldForge.entry_point_discovery()report. A constructor flag and theWORLDFORGE_DISABLE_ENTRY_POINTSenvironment variable both turn discovery off. New public surface:discover_entry_point_providers,EntryPointDiscoveryReport,EntryPointSkip,ENTRY_POINT_GROUP, andENTRY_POINT_DISABLE_ENV_VAR. Documentation lives atdocs/src/external-providers.md. - Added capability negotiation reports through the new
worldforge negotiateCLI subcommand andworldforge.capability_negotiationPython surface. Reports state — before a workflow runs — whether the registered and known providers can satisfy a capability set such aspredict-only,embed-only,score-only,policy-plus-score, or one of the evaluation suites' required-capability shapes. For each capability slot the report lists every candidate provider's registration, configuration, health, capability compatibility, readiness state (ready,missing-config,missing-dependency,unsupported,not-registered), and a typed reason; blocked workflows surface focused recommended actions. Output is JSON + Markdown; the CLI exits non-zero when at least one workflow is blocked, which makes it suitable as a CI guard. - Upgraded the release evidence generator into a release-readiness command that writes Markdown and
JSON summaries, can execute checkout-safe gates with
--run-gates, records skipped and failed gate triage steps, and marks optional live-provider evidence as host-owned unless a prepared-host run manifest is linked. - Added a public API stability and deprecation policy covering stable, provisional, experimental, and internal surfaces, with migration expectations for provider capabilities and artifact schemas.
- Added a troubleshooting matrix for public error families, provider contract failures, benchmark budget exits, and docs-build warnings with owner, command, artifact, and first-triage guidance.
- Added a documented-command drift checker for README, CLI docs, examples, operations, playbooks, and AGENTS command surfaces, and wired it into the release-readiness gate.
- Added a checkout-safe core performance budget checker for world persistence, benchmark fixture loading, provider diagnostics, evidence-bundle creation, and report rendering, and wired it into release-readiness documentation.
- Added a contributor bootstrap doctor for Python, uv, docs tooling, GitHub CLI auth, source-tree shape, and optional runtime skip reasons.
- Added supply-chain and artifact integrity documentation covering current package/evidence gates, hashes, unsafe artifact exclusions, and future SBOM/provenance/attestation boundaries.
- Added a wrapper portability checker for shell wrappers and optional-runtime smoke commands, and wired it into CI and release-readiness gates.
- Reworked the public docs information architecture around reader paths for provider authors, operators, evaluators, release maintainers, demos, and roadmap history.
- Added a checkout-safe demo showcase runner with ten issue-backed workflows covering first-run local worlds, diagnostic issue bundles, robotics replay, remote media dry-runs, adapter authoring, batch eval, stdlib service host, Rerun gallery manifests, failure recovery labs, and cookbook validation, plus public docs and recipes for the preserved artifacts and triage steps.
- Added a roadmap expansion plan for 30 structured GitHub issues across production-grade quality/DevX/docs, demos and end-to-end showcases, and new features, explicitly excluding the already assigned Nano World Model work.
- Added contributor triage guidance for roadmap stream, capability, severity, and release-scope labels, and routed provider plus evaluation/benchmark issue templates to provider promotion, evidence, and private-security reporting expectations.
- Added read-only local state preflight through
worldforge world preflight. It checks world state directories, file-safe requested IDs, corrupted world JSON, invalid histories, object bounding boxes, preserved run manifests, stale run workspaces, unsafe artifact paths, and retention pressure while returning safe-to-attach diagnostics and explicit quarantine or dry-run recovery commands. - Added checkout-safe operator failure drills through
worldforge drills. The drills cover missing credentials, missing optional dependencies, malformed provider output, benchmark budget violations, corrupted local world state, expired artifacts, and unsafe event metadata while preserving run manifests and optional issue bundles under the requested workspace. - Added reference host deployment recipes for the stdlib service, batch evaluation, and robotics operator hosts. The recipes cover env templates, process/readiness/smoke/logging/evidence commands, expected success signals, first triage and rollback steps, and the checkout-safe, prepared-host, credentialed, GPU-bound, and robotics-lab ownership boundaries.
- Added issue-ready bundles for preserved run workspaces.
worldforge runs bundle <run-id>now exports one run toevidence_manifest.json,summary.md, andissue.md, prints a short issue template, preserves SHA-256 digests andsafe_to_attachflags, and marks unsafe or host-local artifacts before attachment. - Added preserved-run history actions through
worldforge runs index,worldforge runs bundle, and run-comparison reports. Operators can filter run workspaces by provider, capability, status, date, and safe artifact type; rows expose sanitized rerun commands plus issue-bundle and comparison actions, with failed/skipped/cancelled runs surfacing the recovery bundle command first. - Provider scaffolding now generates a fuller fail-closed contract pack: an explicit
--implementation-status scaffoldmaturity claim, provider/profile tests for disabled capability calls, placeholder fixtures marked as non-evidence, an incomplete.json.stubruntime manifest, a workbench checklist, and printed validation commands. Existing scaffold files still require--forcebefore overwrite. - Added an adapter author workbench flow for provider promotion evidence. The non-Textual
workbench now handles catalog providers, scaffold providers, and the direct-construction
jepa-wmscandidate; reports include runtime manifest status, fixture coverage, docs/catalog drift, redaction checks, promotion gaps by target status, safe artifact references, and validation commands, andworldforge provider workbenchexposes the same logic through the CLI. - Added cross-provider run comparisons for preserved eval and benchmark workspaces.
worldforge runs comparenow exports a shared JSON/Markdown/CSV model with provider rows, capability and operation context, fixture digest, suite version, budget status, event counts, missing evidence, skip reasons, and claim-boundary language while refusing incompatible capability, fixture, budget, operation, or suite-version contexts. - Added sanitized evaluation failure galleries. Failed evaluation reports now expose
representative fixture-level cases with expected contract notes, observed summaries, metrics
previews, and triage steps;
report.artifacts()also exportsfailure_gallery.jsonandfailure_gallery.mdfor issue attachments. - Added benchmark budget calibration artifacts generated from preserved benchmark JSON reports.
scripts/calibrate_benchmark_budgets.pywrites a loadablecandidate-budgets.json, a fullbudget-calibration.json, and a human-review Markdown report with source report digests, machine context, old thresholds, candidate thresholds, observed baselines, and rationale fields without modifying existing release budget files. - Added a claim-to-evidence map for public README-level claims, capability surfaces, runtime boundaries, preserved artifacts, and explicit non-claims.
- Added a checkout-safe evidence bundle exporter for preserved eval and benchmark runs. The
bundle copies safe reports, run manifests, event logs, preset inputs, and budget fixtures,
records SHA-256 digests and
safe_to_attachflags, excludes unsafe or local-only artifacts, and can be linked from release evidence reports. - Added five named benchmark presets —
mock-smoke,parser-overhead,remote-media-dryrun,prepared-host, andrelease-evidence— exposed through newworldforge benchmark --list-presets,--show-preset, and--presetflags. Presets bundle a deterministic input fixture, an optional budget file, and a runtime-profile gate so checkout-safe regression checks, remote-media dry-runs, prepared-host evidence runs, and release gating each have a named entry point. Remote-media and prepared-host presets skip with a typed reason when the required provider environment is missing; checkout-safe and release presets fail non-zero on budget violations. Public surface lives atworldforge.benchmark_presets(BenchmarkPreset,list_presets,get_preset,load_preset_inputs,load_preset_budgets). - Added a packaged capability fixture corpus under
worldforge.testing.fixturescovering thepredict,embed,score, andpolicycapabilities. Each capability ships one valid baseline plus at least two invalid boundary fixtures with distinct error patterns. The newworldforge.testing.load_capability_fixture,iter_capability_fixtures,iter_all_fixtures,list_fixture_names, andCapabilityFixturehelpers let conformance tests, evaluation suites, and provider authors reuse canonical inputs instead of inlining payloads. Theassert_*_conformance()helpers' keyword arguments match each fixture'spayloadkeys so a fixture can be passed straight through. - Added a result
provenanceenvelope (schema_version: 2) to evaluation and benchmark JSON and Markdown reports. The envelope carries WorldForge version, command argv, providers, capabilities, runtime manifest references, input and result digests, budget file summary, emittedProviderEventcount, suite contract version, claim boundary, and metric semantics so a claim can be reproduced and cited without console logs. CSV output, the existingrun_metadata.input_file, andrun_metadata.budget_filefields are unchanged for backward compatibility. - Added
cosmos-policyas a host-owned NVIDIA Cosmos-Policy ALOHA/actserver adapter for thepolicycapability, including a runtime manifest, live-smoke CLI, provider docs, configuration summaries, and policy-plus-score planning coverage without adding CUDA, Docker, torch, or Cosmos-Policy dependencies to the base package. - Added an optional OpenTelemetry provider-event sink that maps sanitized provider events to host-owned tracing spans without adding OpenTelemetry to the base dependency set.
- Added an optional Rerun integration for sanitized
ProviderEventstreams, world snapshots, plans, benchmark reports, and arbitrary JSON artifacts throughRerunEventSink,RerunArtifactLogger,RerunSession, andRerunRecordingConfig. - Added the
rerunoptional extra andworldforge-demo-rerun, a checkout-safe showcase that writes a local.rrdrecording by default and supports spawned, remote, or in-process gRPC Rerun viewer workflows. The extra accepts the Rerun SDK range needed to coexist with LeRobot runtime environments. - Added Rerun visual layers for 3D world object boxes, robotics candidate targets, selected replay
paths, score bars, and latency bars.
scripts/robotics-showcasenow writes a Rerun.rrdartifact for normal PushT policy+score runs unless--no-rerunis passed. - Added a roadmap continuation document that defines the next three GitHub issue streams: provider evidence and runtime cohorts, evaluation evidence and claim integrity, and operator workflow plus adapter authoring.
- Added a provider cohort selection record that scores active and deferred provider candidates, selects the next evidence cohort, and keeps provider catalog claims unchanged until runtime evidence exists.
- Added a spatial scene artifact boundary record for future 3D scene providers, including candidate decisions, JSON-native artifact shape, asset redaction rules, host-owned responsibilities, and the fixture contract for follow-up validation.
- Added
validate_scene_artifactfor checkout-safe spatial scene artifact validation, plus tiny valid and malformed fixtures covering transforms, units, non-finite values, unsafe references, and oversized metadata. - Added a live-smoke evidence registry with schema validation, first-class missing-runtime and missing-credential skip statuses, docs for safe provider issue attachments, and release-evidence rendering support.
- Added a JEPA-WMS runtime manifest for prepared-host smoke evidence and a stable
input_digestfield for smoke run manifests with synthetic input summaries. - Added an optional live robotics showcase workflow for pull request and main-branch push
validation.
It runs real LeRobot policy inference plus real LeWorldModel checkpoint scoring in
non-interactive JSON mode, validates the resulting provider events and tensor contract, caches
Hugging Face/LeWorldModel checkpoint assets with
actions/cache, and uploads sanitized run evidence artifacts.
- Runway artifact downloads now validate provider-returned URLs before fetching, block local, private, and link-local destinations unless explicitly opted in, and stream downloaded bytes with a hard size cap instead of buffering unbounded response bodies.
- Preserved LeRobot loader provenance after lazy policy loading so real
from_pretrainedruns no longer report asinjected_policyin policy result metadata or provider events. - Documented first triage commands for Cosmos and Runway media artifacts and added focused provider-level regression tests for failed, malformed, unsupported, expired, and retry-exhausted remote media paths.
- Added a MkDocs Material documentation site, strict docs-build validation, and a GitHub Pages
workflow that deploys the site from
main. - Added
SECURITY.mdwith the vulnerability-reporting path and supported-version policy. - Added public governance and contributor surfaces: code of conduct, support policy, maintainer ownership, citation metadata, issue templates, pull request template, and CODEOWNERS.
- Added documented
examples/benchmark-inputs.jsonandexamples/benchmark-budget.jsonfixtures so README and docs benchmark commands are copy-paste runnable. - Added explicit claim-boundary and metric-semantics fields to evaluation and benchmark JSON and Markdown reports.
- Added capability-protocol registration for narrow
Cost,Policy,Generator,Predictor,Reasoner,Embedder, andTransfererimplementations, including diagnostics, planning, and benchmark routing without requiring a fullBaseProvidersubclass. - Added an engineering quality standards page that maps WorldForge's Python packaging, testing, linting, typed-distribution, ML reproducibility, and robotics-runtime boundaries to upstream Python and scientific-computing guidance.
- Exported benchmark budget and fixture-loading helpers from the top-level Python package so provider benchmarking workflows do not need to reach into implementation modules.
- Scene object mutations and prediction actions now validate history payload JSON before committing state changes, so malformed metadata or action payloads cannot leave an in-memory world half mutated.
- Provider events now sanitize observable request targets and obvious secret-bearing message or metadata fields before logs or in-memory sinks can record them. Signed artifact URLs keep scheme/host/path context but drop query strings, fragments, and userinfo.
- Cosmos and Runway now strip whitespace and treat blank environment variables as unset, matching
the behaviour of every other provider. A blank
COSMOS_BASE_URL,NVIDIA_API_KEY,RUNWAYML_BASE_URL,RUNWAYML_API_SECRET, orRUNWAY_API_SECRETno longer masks as configured. scripts/test_package.shnow installs the built wheel generically instead of assuming the oldworldforge-*.whlfilename prefix, so the package contract still works after theworldforge-airename.scripts/test_package.shnow validates wheel and sdist contents before installing the wheel, including capability protocol files, thepy.typedmarker, console scripts, and source-package docs/tests/scripts.- Public JSON-carrying models now reject non-JSON-native action parameters, scene metadata, provider-event metadata, score metadata, policy raw actions, and policy metadata at construction time instead of allowing values that fail later during persistence or artifact serialization.
PredictionPayload,EvaluationResult, provider summaries, andBenchmarkResultnow validate their JSON fields, finite metrics, score ranges, counts, and result coherence at construction time so invalid report artifacts fail before rendering.- World import/load validation now requires the persisted schema version, validates embedded scene
object payloads, rejects
SceneObjectPatchmisuse, and treats a single provider string inWorld.compare(...)as one provider rather than a sequence of characters. - Capability protocol adapters now wrap unexpected runtime exceptions as
ProviderErrorafter emitting failure events. - LeWorldModel direct scoring now validates checkpoint-native candidate shape and requires one
returned score per candidate sample before constructing
ActionScoreResult. - Benchmark budget fixtures now reject unknown top-level or budget-entry keys so typoed release thresholds fail closed.
- The LeWorldModel object-checkpoint builder now supports pinned Hugging Face revisions and loads
downloaded weights with
torch.load(..., weights_only=True)by default. The--allow-unsafe-pickleescape hatch is explicit for trusted legacy weights only. scripts/robotics-showcase --health-onlyno longer auto-builds or downloads a missing LeWorldModel object checkpoint; preflight reports checkpoint absence without mutating the cache.- Production CLI and LeRobot provider checks now raise explicit WorldForge errors instead of
relying on Python
assertstatements that disappear under optimized execution.
- CI workflows now run on Python 3.13 only. The multi-version OS/Python test matrix was removed, and Pages, release, and security jobs were aligned to the same interpreter version.
- Package metadata, docs, optional-runtime wrapper commands, and lint target now declare Python 3.13 only so the published support contract matches CI.
- Package metadata now uses SPDX license metadata and an explicit
uvpackage marker, while the Hatch wheel target is restricted to runtime package files. - Pytest now runs with importlib import mode and strict xfail handling, while Ruff enforces sorted exports, explicit mutable class metadata annotations, literal exception-match patterns, and clearer pytest imports/assertions.
- Ruff now also enforces comprehension, simplification, return, performance, pytest-style, and
Ruff-native correctness rules across
src,tests,examples, andscripts. - Provider contract helpers now use explicit
AssertionErrorchecks instead of Pythonassertstatements, so reusable adapter validation still runs under optimized Python. - Dedupe repeated provider scaffolding into shared
BaseProvider._emit_operation_eventandBaseProvider._healthhelpers, and moveno_grad_contextplusprepare_modelintoproviders/_policy.py. The cosmos, runway, leworldmodel, lerobot, gr00t, and jepa-wms adapters are unchanged externally but significantly shorter internally. - Consolidate the shared
blue_cubetabletop scenario used by the LeRobot and LeWorldModel demos intoworldforge.demos.make_blue_cube/blue_cube_goal/make_candidate_plans. - TheWorldHarness no longer eagerly imports
worldforge.demos.*at module load; demo flow runners import lazily so the harness cold start does not drag the optional-runtime provider classes into memory. - The robotics showcase wrapper no longer suppresses LeRobot runtime device fallback warnings, so CUDA-to-MPS or similar execution changes remain visible in the terminal.
- Documentation now routes release validation through explicit
uv/bashgate commands, keeps robotics preflight commands visible from the README/CLI/examples pages, and splits long optional runtime commands into copy-pasteable blocks. - The robotics showcase deep dive now includes end-to-end flow, model payload, inference responsibility, and sequence diagrams that show how LeRobot policy inference, LeWorldModel cost inference, WorldForge planning, mock replay, and the visual report fit together.
- The README, package metadata, citation metadata, docs site description, and introduction now use a tighter project pitch: testable world-model workflows for physical-AI systems.
save_worldskips a redundantjson.dumps/json.loadsround trip; the validation call now runs directly against the serialized dict.- Documentation metadata and README links now point at the published GitHub Pages site.
- Release tags now run the full quality gate before artifacts are built or published: lint, formatting, strict docs, coverage, dependency audit, package contract, and tests.
- The release gate now includes the lockfile check, coverage gate, package contract, build, and dependency audit using the locked dev environment.
- JEPA and Genie scaffold providers now advertise no executable capabilities. Their deterministic
mock-backed surrogate path remains available only for local adapter tests with
WORLDFORGE_ENABLE_SCAFFOLD_SURROGATES=1. - Mock no longer advertises the provider-level
plancapability; planning remains a WorldForge facade workflow built from provider-specificpredict,score, andpolicysurfaces. - Release publishing now verifies that the pushed tag matches the package version, uses locked
pip-audit, attaches build provenance attestations, and is configured for PyPI trusted publishing.
- Hardened
ProviderEventserialization so structured provider logs do not leak bearer tokens, API keys, signed URL query strings, or secret-like metadata values. - Scaffold provider capability fail-closed behavior prevents deterministic surrogate outputs from being mistaken for real JEPA or Genie provider results in evals, benchmarks, or public reports.
- Added the M3-M5 TheWorldHarness surfaces:
ProvidersScreenwith a capability matrix and a cancellable realmock.predictrun;EvalScreenandBenchmarkScreenwith preserved JSON reports; Run Inspector report previews; Home recent worlds/runs; dynamicCtrl+Pentries for worlds, providers, and saved reports. - Added the
worldforge-high-contrasttheme to the existing theme cycle and documented the three shipped themes. - Added local harness guard scripts that reject raw hex literals in widget CSS and network-egress
calls under
src/worldforge/harness/. - Added a Textual screenshot export matrix for the main harness screens at
100x30,120x40, and160x50, plus a deterministic README screenshot regeneration script. - Reskinned TheWorldHarness with registered
worldforge-darkandworldforge-lightthemes, retiring the hard-coded hex literals insrc/worldforge/harness/tui.pyin favour of semantic tokens ($accent,$success,$warning,$error,$panel,$boost,$surface, plus the custom$mutedvariable) so the harness reads as a polished workspace on light terminals. - Added a header chrome strip with a
worldforge > <flow>breadcrumb and a<provider> . <capability>status pill that update reactively when the selected flow changes. - Added a hidden
Ctrl+Tbinding that cycles between the two registered themes without restarting the harness. - Split TheWorldHarness into a screen stack: a
HomeScreenlanding page with three jump cards (ncreate a world,prun a provider,erun an eval), aRunInspectorScreenthat owns the existing flow visualisation, plus modalHelpScreenandPlaceholderScreenoverlays.worldforge-harnessopens on Home by default and on the Run Inspector when--flowis passed. - Added the static command palette layer via
App.get_system_commands(Ctrl+P): "Jump: Home", "Jump: Run Inspector", "Open Help", one "Run flow: <title>" entry per registered flow, "Switch theme", and the stock Quit. Dynamic entries now index worlds, providers, and recent runs. - Added
?to open a modalHelpScreenthat lists every binding declared on the screen below it, plus chord bindingsg h/g rfor jump-to-Home and jump-to-Run-Inspector. - Updated the
Headerbreadcrumb to reflect the active screen, deepening to the selected flow on the Run Inspector (worldforge › run-inspector › <flow>). - Added a standalone Textual report for
scripts/robotics-showcasethat renders the real LeRobot-plus-LeWorldModel run as a pipeline trace with metric bars, candidate ranking, provider events, and a fixed tabletop replay. - Improved the robotics showcase report layout into a vertical, scrollable story with full-width candidate ranking, full-width tabletop replay, staged reveal delays, and an illustrative animated robot-arm replay.
- Added an in-report reading guide for runtime, tensor, and candidate-ranking panes plus a
?tabletop-replay help overlay for the real robotics showcase TUI.
- Added
lewm-real, a shortuv runalias for real LeWorldModel checkpoint inference. The command now accepts--checkpoint, prints a staged pipeline log by default, and preserves machine-readable output with--json-only. - Added
lewm-lerobot-realandworldforge-smoke-lerobot-leworldmodel, a host-owned real robotics smoke/showcase that composes LeRobot policy inference with LeWorldModel checkpoint scoring through WorldForge policy-plus-score planning, including visual logs and JSON output. - Added
scripts/robotics-showcaseandworldforge-robotics-showcase, a one-command PushT real robotics entrypoint that packages the LeRobot observation, LeWorldModel score tensor, translator, and action-candidate bridge defaults for the LeRobot + LeWorldModel showcase. - Expanded the real robotics showcase output with an ASCII pipeline map, runtime bars, score summary, candidate target table, and tabletop replay diagram while keeping the machine-readable JSON path available.
- Made
scripts/robotics-showcaselaunch the Textual visual report by default while preserving--no-tui,--json-only, and--health-onlyfor plain terminal, automation, and preflight runs. - Added
--tui-stage-delayand--no-tui-animationto control the robotics showcase reveal pace and animation. - Added the
worldforge worldCLI command group for local JSON persistence workflows, including create, list, show, history, export, import, and fork commands backed by the existing validatedWorldForgepersistence API. - Added persisted-world mutation and prediction commands:
worldforge world objects,add-object,update-object,remove-object, andpredict. These commands load local JSON worlds, apply typed scene/action values, and save throughWorldForge.save_world(...);world predict --dry-runpreviews provider output without replacing the saved file. - Added
WorldForge.delete_world(...)andworldforge world deleteso local JSON world removal uses the same validated persistence boundary as save/load/import/fork. TheWorldHarness now calls this public API instead of unlinking world files directly. - Added persisted history entries for scene object add/update/remove mutations, including typed
Actionpayloads and snapshots that can be restored or forked. Object position patches now translate bounding boxes with the pose to keep local scene state coherent. - Added
worldforge exampleswith Markdown and JSON output so CLI users can discover checkout scripts, packaged demos, and optional smoke commands without scanning repository docs. - Added the
worldforge-demo-lerobotconsole command and packaged the LeRobot policy-plus-score planning walkthrough undersrc/worldforge/demos/lerobot_e2e.py, keepingexamples/lerobot_e2e_demo.pyas a compatibility wrapper. - Added
examples/README.mdas a short command index for the checkout scripts and packaged demos. - Added CLI help snapshot tests for the primary
worldforgecommand surface. - Added
scripts/generate_provider_docs.pyso the provider catalog table can be refreshed and checked fromsrc/worldforge/providers/catalog.py. - Added
worldforge provider docsso users can discover provider documentation paths from the CLI. - Added TheWorldHarness as an optional Textual TUI (
worldforge-harnessandworldforge harness) for visually running and inspecting packaged E2E demos. - Added a TheWorldHarness diagnostics flow for provider catalog inspection and mock benchmark comparison across predict, reason, generate, and transfer.
- Added benchmark budget gates for release and claim-oriented checks.
worldforge benchmarkcan load a JSON budget file, print gate violations, and exit non-zero when success-rate, error-count, retry-count, latency, throughput, or unmatched-budget checks fail. - Added benchmark input fixtures.
worldforge benchmark --input-file benchmark-inputs.jsonnow loads deterministic JSON inputs for prediction, generation, transfer, embedding, score, and policy runs; transfer clips can point at files relative to the input JSON or inline base64 frames. - Added
.env.exampledocumenting every provider environment variable recognized by WorldForge (COSMOS_BASE_URL,NVIDIA_API_KEY,RUNWAYML_API_SECRETand the legacyRUNWAY_API_SECRETalias,RUNWAYML_BASE_URL,LEWORLDMODEL_POLICYand the legacyLEWM_POLICYalias,LEWORLDMODEL_CACHE_DIR,LEWORLDMODEL_DEVICE, the fullGROOT_POLICY_*andGROOT_EMBODIMENT_TAGset, the fullLEROBOT_*set including the legacyLEROBOT_POLICYalias, theJEPA_WMS_*candidate variables, and the scaffoldJEPA_MODEL_PATHandGENIE_API_KEY). - Added
lerobotas a first-class optional policy provider for Hugging Face LeRobot pretrained policies. The adapter lazily imports LeRobot, supports injectable policies and policy loaders for offline testing, validates observation payloads, preserves raw policy tensors, and requires a host-owned action translator before returning executable WorldForge actions. - Added
leworldmodelas a first-class optional score provider for LeWorldModel JEPA cost models, includingActionScoreResult,WorldForge.score_actions(...), score-output validation, provider profile metadata, and fixture-driven tests. - Added score-based planning, the
policycapability,ActionPolicyResult,WorldForge.select_actions(...), policy-only planning, and policy-plus-score planning. - Added experimental host-owned
gr00tPolicyClient support, ajepa-wmsdirect-construction score-provider candidate scaffold, andscripts/scaffold_provider.pyfor safe provider scaffolding. - Added checkout-safe LeWorldModel and LeRobot demos plus optional LeWorldModel/GR00T/LeRobot smoke entry points for host-owned runtimes.
- Bumped project metadata and public citation references to
0.4.0. - Validated provider capability names across public capability checks and CLI provider filters,
so typos such as
generationfail explicitly instead of being treated as unsupported. - Changed
ProviderCapabilities()to advertise no operations by default. Providers must opt into every capability explicitly, and unsupportedpredict()calls now fail withProviderErrorinstead ofNotImplementedError. - Centralized in-repo provider discovery in
src/worldforge/providers/catalog.py, including the provider factory list and explicit always-register policy formock.WorldForgenow uses the catalog instead of relying on constructor ordering in_known_providers(). - Extended the provider catalog with documentation-page and runtime-ownership metadata used by the generated provider docs table.
- Moved the README provider surface table onto the same generated catalog source as the provider docs index.
- Grouped
worldforge examples,docs/src/examples.md, andexamples/README.mdby task so prediction, comparison, score planning, policy planning, and optional smoke paths are easier to scan. - Reworked the README, introduction, architecture, provider, and operations docs around the capability contract: predictive models, score providers, policy providers, media adapters, host-owned optional runtimes, and explicit persistence/evaluation boundaries.
- Added dedicated provider pages for Cosmos, Runway, and LeWorldModel, and normalized the GR00T, LeRobot, and JEPA-WMS pages around capability surface, runtime ownership, input/output contracts, failure modes, and validation coverage.
- Updated package metadata around WorldForge's physical-AI world-model integration layer, removed the development-status classifier, and pointed documentation metadata at repository docs instead of a standalone project domain.
- Aligned documented Ruff commands with CI,
README.md, andAGENTS.mdby keepingscripts/in bothruff checkandruff formattargets.
- Rejected non-file-safe world IDs before local persistence reads and writes, preventing traversal through imported or caller-supplied world identifiers.
- Validated persisted world history entries end to end, including non-negative entry steps, historical snapshot states, non-empty summaries, serialized action payloads, and the invariant that history entry steps cannot exceed the current world step.
- Wrote saved worlds through validated same-directory temporary files before atomically replacing the destination JSON file.
- Rejected stringly-typed booleans for scene object graspability, provider capabilities, and the
JEPA-WMS
actions_are_normalizedoption instead of silently coercing values such as"false"toTrue. - Tracked
.env.examplein the repository by adding an explicit!.env.exampleexception to.gitignore; the general.env.*glob was silently excluding the onboarding template. - Rejected non-finite public numeric inputs for positions, rotations, request policies, provider events, video clips, reasoning confidence, embedding vectors, generation FPS, and prediction payload metrics.
- Rejected duplicate scene object IDs when adding objects to a world.
- Rejected persisted/provider-supplied world state whose scene-object map key disagrees with the
object's embedded
id. - Made the coverage gate invoke pytest with the
harnessextra so optional Textual TUI tests are available during coverage runs while the base package and matrix tests remain free of Textual. - Validated Runway ratio parsing before constructing returned clip metadata.
- Validated Cosmos health and generation response payloads before decoding returned videos.
- Validated Runway organization, task creation, task polling, task output, artifact content type, expired artifact, and empty artifact responses before returning clips.
- Corrected the LeWorldModel smoke task to require an existing upstream object checkpoint instead of relying on a nonexistent PyPI checkpoint-preparation helper.
- Updated the real LeWorldModel smoke instructions to use the GitHub
stable-worldmodelsource package anddatasets>=2.21, matching the runtime that can load supported LeWM checkpoints. - Rejected score-based plans when the score provider returns a different number of scores than executable candidate action plans, preventing provider-native score tensors from drifting away from the actions WorldForge can execute or report.
- Hardened local JSON persistence against path traversal by validating world IDs before resolving storage paths.
- Raised the development dependency floor to
pytest>=9.0.3and refresheduv.lockto remove the lockedpytest 9.0.2vulnerability reported asCVE-2025-71176.
- Added
AGENTS.mdwith repository identity, architecture, commands, conventions, constraints, and gotchas for contributors. - Promoted the real LeRobot-plus-LeWorldModel robotics showcase to the top of the README with screenshots, a one-command entrypoint, and a dedicated walkthrough covering the pipeline, runtime boundaries, artifacts, and customization path.
- Added a dedicated CLI reference and reduced duplicate README/provider demo prose so the public front face points to one command map instead of repeating optional-runtime narratives.
- Added user and operator playbooks for checkout validation, provider capability selection, provider diagnostics, adapter promotion, local persistence recovery, remote artifacts, optional runtime smokes, benchmarks, incident triage, and release gates.
- Added this changelog and linked it from the README.
- Documented host-owned persistence, provider limits, and provider workflow failure modes.
- Added a world-model taxonomy document, plus expanded architecture docs with text and Mermaid diagrams for provider injection, predictive planning, score-based planning, observability, and the LeWorldModel-shaped runtime pipeline.
- Added a provider authoring guide that turns the taxonomy into capability, validation, testing, observability, and documentation checklists for new adapters.
- Documented GR00T live-smoke requirements for Isaac-GR00T's CUDA/TensorRT runtime and the remote policy-server path for unsupported hosts.
- Typed planning goals for
object_at,object_near,spawn_object, andswap_objects. - Built-in evaluation suites for generation, physics, planning, reasoning, and transfer.
- Provider benchmark harness with latency, retry, throughput, JSON, Markdown, and CSV reporting.
- Provider observability through
ProviderEvent, JSON logging, in-memory recording, and metrics aggregation sinks. - HTTP-backed Cosmos and Runway beta adapters with typed request policy and retry behavior.
- Reusable provider contract checks under
worldforge.testing.
Constraints carried forward from this release; see the docs site for the current status of each:
- JEPA and Genie remain scaffold adapters backed by deterministic mock behavior after credential checks.
- Evaluation scores are deterministic adapter contract signals, not physical fidelity or media quality guarantees.
- World persistence is local JSON and is not safe as a concurrent multi-writer store.