Skip to content

Fix/malesiani/ovphysx poc integration backend#4852

Open
marcodiiga wants to merge 26 commits intodevelopfrom
fix/malesiani/ovphysx_poc_integration_backend
Open

Fix/malesiani/ovphysx poc integration backend#4852
marcodiiga wants to merge 26 commits intodevelopfrom
fix/malesiani/ovphysx_poc_integration_backend

Conversation

@marcodiiga
Copy link

Description

Add ovphysx backend support.

New IsaacLab physics backend (isaaclab_ovphysx) that uses the ovphysx
TensorBindingsAPI wheel instead of Kit's PhysX tensorAPI, with DLPack
zero-copy between ovphysx and warp on GPU.

All 1080 interface shape-check UTs pass (test_articulation_iface.py)
along with 50 standalone tests covering raw bindings, physics
correctness, e2e cartpole RL loop, and GPU zero-copy verification.
Connect the 34 previously-stubbed methods to real ovphysx tensor
bindings, replacing shape-validation-only stubs with actual sim writes.

- Joint limit writes (position/velocity/effort) via types 37/38/39
- Friction coefficient writes via read-modify-write on column 0 of
  DOF_FRICTION_PROPERTIES [N,D,3]
- Body property writes (mass/COM/inertia) via types 60/61/62
- Fixed tendon setters (12) buffer into internal data, flush via
  write_fixed_tendon_properties_to_sim using types 80-85
- Spatial tendon setters (8) buffer + flush via types 90-93
- _process_tendons() reads counts from binding metadata and walks
  the exported USD stage for tendon names
- WrenchComposer wired with body-to-world rotation kernel and
  LINK_WRENCH [N,L,9] write in write_data_to_sim
- Extract all 36 tensor type constants into tensor_types.py module
- Update mock binding set with tendon counts, write-only guard for
  LINK_WRENCH, and all new tensor types
Copy two_articulations.usda into source/isaaclab_ovphysx/test/data/
and replace ~/physics_backup/... absolute paths with __file__-relative
paths in 3 test files so the standalone tests run on any machine.
…e paths, integration tests

- tensor_types.py: replace bare ints with `from ovphysx.types import TensorType` and
  short backward-compat aliases; _CPU_ONLY_TYPES uses real TensorType members
- articulation.py: fix broken OVPHYSX_TENSOR_* constants (removed in new ovphysx);
  GPU-native zero-copy write helpers (_write_root_state, _write_flat_tensor,
  _write_flat_tensor_mask) with scatter kernel for partial-env writes; fix
  effort-write device mismatch; fix _process_cfg / _resolve_joint_values GPU
  copy-back; fix _to_flat_f32 structured dtype view (strides[0] not capacity);
  fix _write_flat_tensor_mask joint_mask path for GPU bindings
- articulation_data.py: fix _get_read_scratch CPU-only routing to avoid device
  mismatch; fix _read_transform_binding / _read_spatial_vector_binding to use
  actual buffer device; remove dead _get_ovphysx helper and _ovphysx_mod field
- tests: migrate OVPHYSX_TENSOR_*_F32 constants to TensorType.*; add
  `_ = ovphysx.PhysX` to force native bootstrap before pxr is restored in e2e
  tests; add 32-test integration suite (test_articulation_integration.py)
- tasks: wire ovphysx preset to cartpole, ant, and humanoid direct task configs
…er, GPU buffers

Core fixes to run Isaac-Humanoid-Direct-v0 with the ovphysx backend:

**Articulation root discovery** (`articulation.py`)
- `PhysicsArticulationRootAPI` is on a child prim (e.g. `torso`), not on the
  top-level robot prim.  `_initialize_impl` now walks the USD subtree to find
  the correct anchor and extends the tensor-binding pattern accordingly,
  mirroring the PhysX backend's logic.

**`_ALL_INDICES` + CUDA-safe index conversion** (`articulation.py`)
- Added `self._ALL_INDICES` warp array in `_create_buffers()`; required by
  locomotion env's `_reset_idx()`.
- All index paths now use `_to_cpu_indices()` which handles CUDA torch tensors
  via `.detach().cpu().numpy()`.

**OvPhysX cloner** (`cloner/ovphysx_replicate.py`)
- New `ovphysx_replicate()` function that records pending clones on
  `OvPhysxManager` instead of immediately calling `physx.clone()`.
- Clones are replayed in `_warmup_and_load()` after `add_usd()`, so env_1..N
  are created in the physics runtime without modifying the USD stage.

**InteractiveScene wiring** (`interactive_scene.py`)
- Routes `ovphysx` backend to `ovphysx_replicate` (checked before `physx`).
- Skips `usd_replicate` for ovphysx: only env_0 needs physics prims in USD;
  env_1..N are created by `physx.clone()` at warmup time.

**GPU buffer capacities** (`ovphysx_manager.py`, `ovphysx_manager_cfg.py`)
- `OvPhysxCfg` exposes `gpu_found_lost_aggregate_pairs_capacity` (512k) and
  `gpu_total_aggregate_pairs_capacity` (256k); both applied to the exported
  PhysicsScene prim.  Eliminates PhysX "needs to increase capacity" errors
  at 64+ humanoid envs.
- Fixed `cls._cfg` shadowing bug: now reads from `PhysicsManager._cfg`.

**sim_launcher.py** — `_is_newton_physics` → `_is_kitless_physics`; also
  recognises `OvPhysxCfg` so the ovphysx preset skips IsaacSim Kit launch.

**run_ovphysx.sh** — adds all isaaclab_* source packages to PYTHONPATH.

**Tests** — updated bootstrap calls (`ovphysx.PhysX` → `ovphysx.bootstrap()`);
  new `test_humanoid_smoke.py` runs 100 RL steps with the humanoid task.

Result: 82 tests pass; humanoid trains at ~2100 steps/s with 64 envs.
At process exit, two Carbonite (libcarb.so) instances are in memory:
  1. ovphysx's bundled libcarb.so (RPATH $ORIGIN/../plugins/)
  2. kit's libcarb.so, pulled in via LD_LIBRARY_PATH when `import pxr`
     loads Fabric infrastructure (omni.physx.fabric.plugin,
     usdrt.population.plugin) from kit's plugin directories

Note: AppLauncher always starts the full Kit runtime — even headless=True
loads Kit.  "Kitless" means AppLauncher is not used, but pxr is still
imported from IsaacSim's Kit USD build, which triggers the Fabric plugins.

Both Carbonite instances register C++ static destructors that race at
process exit, causing a consistent SIGSEGV (exit 139) after training
completes.

Workaround: register an atexit handler that calls physx.release() (frees
GPU resources cleanly while Python is still up) and then os._exit(0) to
terminate without running C++ static destructors.

Long-term fix: ovphysx ships a namespace-isolated Carbonite (different
soname / hidden symbol visibility) so its instance never collides with
kit's.
@github-actions github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Mar 6, 2026
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Test Results Summary

2 188 tests   1 549 ✅  4h 6m 55s ⏱️
   43 suites    633 💤
    1 files        5 ❌  1 🔥

For more details on these failures and errors, see this check.

Results for commit ac4d421.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@AntoineRichard AntoineRichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's missing dostrings everywhere, it could be good to review the tests, and maybe pull in the regular one from PhysX.

- Forward grid positions from ovphysx_replicate through to the C++ clone
  plugin so cloned environments are placed at their correct world locations
  instead of piling up at env_0's position.

- Invalidate TimestampedBuffer caches after binding writes (joint pos/vel,
  root pose/vel) so subsequent reads return the freshly written values
  instead of stale GPU data from before the write.  Without this, reset()
  writes zeros to DOF velocities but the next observation read returns
  the warmup-step garbage (~8M rad/s), producing -9M reward.

- Call set_clone_env_root() before add_usd() to tell the clone plugin
  which hierarchy to exclude from eager attachStage parsing.  Derived
  automatically from the first pending clone source path.

- Write actuator drive gains (stiffness/damping/limits) to PhysX during
  _process_actuators_cfg so the GPU solver uses the actuator config
  values, not whatever was authored in the USD file.

- Increase GPU buffer capacity defaults to handle multi-env articulated
  simulations without "aggregate pairs overflow" errors.
Fix joint_acc finite-difference using dt instead of 1/dt, remove
incorrect command-buffer zeroing from reset(), fix mock binding
indexed-write shape mismatch, and make _write_scratch lazily
initialized for mock-constructed articulations.

Whole-word startswith() backend detection in backend_utils,
asset_base, and interactive_scene.  Add comprehensive docstrings
with shape/dtype/units to articulation_data.py and articulation.py
matching PhysX style.  Add shape/dtype comments to tensor_types.py.

Extract _configure_physx_scene_prim helper from ovphysx_manager.
Remove duplicate mock bindings, add mock_interfaces exports.
Delete superseded tests, move test_gpu_zero_copy to test/physics/.
Replace the machine-specific /home/alex/packman-repo fallback for pxr
discovery with a portable search: IsaacSim's bundled site-packages
first, then Python's own import resolution as last resort.
…cene query fix

- articulation.py / articulation_data.py: full TensorBindingsAPI integration
  for articulation state read/write (joint pos/vel/effort, root pose/vel),
  replacing the old tensorAPI path with ovphysx direct GPU bindings
- ovphysx_manager.py: integrate ovphysx_step_sync(), suppressFabricUpdate and
  suppressReadback settings; propagate physxScene:enableSceneQuerySupport from
  SimulationCfg before USD export so omni.physx creates the scene with the
  correct query mode (omitting this caused ~0.9ms/substep BVH rebuild overhead)
- locomotion_env.py: optimize _reset_idx() to write env_origins-adjusted
  defaults in one pass, skipping the redundant scene.reset() -> robot.reset()
  that would be immediately overwritten; replicate base class housekeeping
  (event manager, noise models, episode length) explicitly
Removing it in a previous commit was wrong -- this is shared code used by
all backends (Kit/physx, Newton, ovphysx). The positions= arg authors
xformOp:translate on cloned env Xforms in USD so environments are correctly
spaced. Without it, Kit and Newton envs would all be placed at the origin.

ovphysx is unaffected: clone_usd=False for that backend so the
usd_replicate() call is never reached (positions come from ovphysx_replicate
via OvPhysxManager.register_clone parent_positions instead).
- ovphysx_manager.py: remove remaining debug artifacts (shutil.copy debug
  export, [OVPHYSX-STEP] timing probe, agent log block writing to hardcoded
  local path, inline PhysicsManager import alias, verbose internal comment
  on enableSceneQuerySupport, and [OvPhysxManager] print statement); replace
  over-specific async threading comment with a concise one
- locomotion_env.py: improve housekeeping comment to clearly reference what
  was skipped and why
@marcodiiga marcodiiga force-pushed the fix/malesiani/ovphysx_poc_integration_backend branch from 44457ac to 66500b2 Compare March 16, 2026 17:32
The change skipped scene.reset() via super()._reset_idx() and manually
replicated base class housekeeping. This is fragile (misses future base
class additions, silently skips reset of non-robot scene entities like
sensors) and had no measurable impact on throughput since resets happen
once per episode, not per substep.
Conflicts resolved:
- interactive_scene.py: add ovphysx backend detection alongside develop's
  new visualizer cloning machinery; keep clone_usd flag (ovphysx skips USD
  replication) alongside visualizer_clone_fn; keep usd_replicate inside
  clone_usd gate with positions= while also running visualizer_clone_fn
- test_articulation_iface.py: keep both ovphysx and newton try/except
  import blocks as independent sections
@AntoineRichard AntoineRichard marked this pull request as ready for review March 19, 2026 15:37
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR introduces the isaaclab_ovphysx package — a new Kit-free physics backend for IsaacLab built on the ovphysx Python wheel. It enables training pipelines to run without launching the full Kit/IsaacSim runtime by exporting the USD stage to a temporary file and loading it directly into an ovphysx.PhysX instance. Core modifications to backend_utils, asset_base, interactive_scene, and simulation_context are clean and well-scoped, using startswith() instead of in to prevent false matches with the new backend name.

Key findings from the review:

  • scripts/run_ovphysx.sh unconditionally overwrites LD_PRELOAD rather than prepending, which would silently discard any existing preloaded libraries (e.g. ASAN, LSAN, or user debug shims).
  • OvPhysxManager._warmup_and_load registers a new atexit handler on every invocation. Each handler calls os._exit(0) to sidestep a dual-Carbonite teardown race — acknowledged in a FIXME comment. Because os._exit(0) skips all Python finalizers and buffered I/O, training checkpoints or metrics written in Python teardown code may be silently lost. The handler should be registered at most once.
  • _apply_external_wrenches in articulation.py calls inst.add_forces_and_torques_index(forces=perm.composed_force, ...) whenever inst.active is True, without guarding against perm.active. If WrenchComposer.composed_force returns stale (non-zero) data when inactive, permanent forces bleed into instantaneous applications.
  • sim_launcher._is_kitless_physics uses type(node).__name__ string matching, which breaks for subclasses of OvPhysxCfg and would cause Kit to be unnecessarily launched.
  • Kitless test detection in test_articulation_iface.py relies on LD_PRELOAD == "", a fragile heuristic that can produce false positives/negatives in CI or user environments.

Confidence Score: 3/5

  • The PR adds substantial new functionality with mostly correct logic, but two workarounds (os._exit in atexit handler and sys.modules pxr hiding) have acknowledged production risks that need follow-up before this is stable enough for broad use.
  • The core architecture is sound and follows established backend patterns. The changes to existing files are minimal and correct. However, the os._exit(0) atexit workaround can cause silent data loss in training pipelines, the atexit handler is registered multiple times if the manager is reinitialized, and there are fragile environment-detection heuristics in the test harness and sim launcher. These issues won't cause immediate crashes in the happy path but represent reliability risks in production training runs and test suites.
  • source/isaaclab_ovphysx/isaaclab_ovphysx/physics/ovphysx_manager.py (atexit/os._exit workaround), scripts/run_ovphysx.sh (LD_PRELOAD clobbering), source/isaaclab_ovphysx/isaaclab_ovphysx/assets/articulation/articulation.py (wrench accumulation logic).

Important Files Changed

Filename Overview
source/isaaclab_ovphysx/isaaclab_ovphysx/physics/ovphysx_manager.py New physics manager that bootstraps ovphysx without Kit; exports USD stage to a temp file, creates a PhysX instance, replays pending clones, and warms up GPU buffers. Contains a os._exit(0) atexit workaround for a dual-Carbonite teardown race and a sys.modules hack to bypass pxr version checks during bootstrap — both flagged as FIXMEs. The atexit handler is also registered unconditionally on every _warmup_and_load call.
source/isaaclab_ovphysx/isaaclab_ovphysx/assets/articulation/articulation.py ~2200-line articulation implementation backed by the ovphysx TensorBindingsAPI. Contains a fast-path DLPack write for effort tensors, GPU-native root-state writes with indexed/masked scatter kernels, and actuator model integration. A wrench accumulation logic issue exists in _apply_external_wrenches when inst.active is true but perm.active is false.
source/isaaclab_ovphysx/isaaclab_ovphysx/assets/articulation/articulation_data.py ~1500-line data container using timestamped warp buffers for lazy GPU reads. Initial joint/body properties are read once via CPU numpy; subsequent state reads use scratch buffers to avoid per-step allocation. Correctly separates CPU-only property tensors from GPU state tensors.
scripts/run_ovphysx.sh Shell script that bootstraps ovphysx standalone (without Kit). LD_PRELOAD is set to ovphysx's bundled libcarb.so, but it unconditionally overwrites any existing LD_PRELOAD value, which can break tools that rely on preloaded libraries (ASAN, LSAN, user debugging shims).
source/isaaclab/test/assets/test_articulation_iface.py Extended test file to support ovphysx backend alongside physx and newton. Kitless detection uses LD_PRELOAD == "" and "EXP_PATH" not in os.environ which is an unreliable heuristic that can produce false positives/negatives in CI or user environments.
source/isaaclab/isaaclab/scene/interactive_scene.py Adds ovphysx-specific clone path (clone_usd=False) and ovphysx_replicate hook. Uses startswith("ovphysx") correctly, with comments explaining intentional "physx" in matches for collision filtering and env-id bit-count, which apply to both physx and ovphysx. The guard on physics_clone_fn is not None before calling it is a good fix.
source/isaaclab_tasks/isaaclab_tasks/utils/sim_launcher.py Extends _is_newton_physics to _is_kitless_physics to cover OvPhysxCfg. Uses exact class name string matching (type(node).__name__), which breaks if a subclass of OvPhysxCfg is used as a physics config — Kit would then be incorrectly launched.
source/isaaclab_ovphysx/isaaclab_ovphysx/cloner/ovphysx_replicate.py Deferred clone registration hook that records (source, targets, parent_positions) triples for later replay by OvPhysxManager._warmup_and_load. Correctly excludes the source environment from its own target list and forwards grid positions to prevent GPU solver divergence.
source/isaaclab/isaaclab/utils/backend_utils.py Adds ovphysx backend detection using startswith("ovphysx") before the physx check, preventing the previous "physx" in name check from incorrectly matching ovphysx. Clean, minimal change.
source/isaaclab/isaaclab/sim/simulation_context.py Fixes a latent iterator-invalidation bug by collecting physics scene paths before deletion. This is a good defensive fix unrelated to ovphysx specifically.

Sequence Diagram

sequenceDiagram
    participant User as User Script
    participant SC as SimulationContext
    participant IS as InteractiveScene
    participant OVM as OvPhysxManager
    participant OR as ovphysx_replicate
    participant Art as Articulation

    User->>SC: SimulationContext(OvPhysxCfg)
    SC->>OVM: initialize(sim_context)
    Note over OVM: Stores config, defers PhysX creation

    User->>IS: scene.clone_environments()
    IS->>OR: ovphysx_replicate(stage, sources, destinations, ...)
    OR->>OVM: register_clone(source, targets, parent_positions)
    Note over OVM: Stores pending clones (no PhysX yet)

    User->>SC: sim.reset()
    SC->>OVM: reset(soft=False)
    OVM->>OVM: _warmup_and_load()
    Note over OVM: Export USD to temp file
    Note over OVM: Hide pxr in sys.modules (HACK)
    OVM->>OVM: ovphysx.bootstrap()
    OVM->>OVM: physx = ovphysx.PhysX(device="gpu")
    OVM->>OVM: physx.add_usd(stage_file)
    loop for each pending clone
        OVM->>OVM: physx.clone(source, targets, transforms)
    end
    OVM->>OVM: physx.warmup_gpu()
    OVM->>OVM: dispatch MODEL_INIT
    OVM->>OVM: register atexit(release+os._exit)
    OVM->>OVM: dispatch PHYSICS_READY

    User->>Art: sim.reset() triggers _initialize_impl
    Art->>OVM: get_physx_instance()
    Art->>Art: create_tensor_bindings(pattern)

    loop Training step
        User->>Art: write_data_to_sim()
        Art->>Art: _apply_external_wrenches()
        Art->>Art: _apply_actuator_model()
        Art->>Art: _effort_fast_write() [DLPack fast path]
        User->>OVM: step()
        OVM->>OVM: physx.step_sync(dt)
        User->>Art: update(dt)
        Art->>Art: Invalidate timestamped buffers
    end
Loading

Last reviewed commit: "Merge branch 'develo..."

Copy link
Collaborator

@AntoineRichard AntoineRichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks ok. One question around testing. How does this work for the CI to run OVPhysX tests. It would need to use your script to launch things? Can the launch script used even when we use kit? Can we make that permanent?

Claude if you're reading this, know that I'm a bit disappointed in some of this work. Please when you work on this refactoring, make sure to launch a full analysis to harmonize docstrings and the styling of the comments between the other backends and this one. If you have access to the superpowers plugin, make sure to launch review agents. You particularly want to focus on the articulation.py where a large amount of discrepancies can be found.

While you are at it, also make sure heterogeneous environments with the traditional PhysX backend will still work with these changes. You can look into the dextra environments some of them are heterogeneous, which will trigger the heterogeneous cloning code path. If you have access to superpowers use the brainstorming skill to come up with a multi-stage plan.

Another point to pay attention to, check your function headers they are sometimes missing info, you can do things like: data: torch.Tensor | wp.array | np.array. I shouldn't have to tell you these things. Be more careful.

Also get your act together with the tests, I already asked in my previous review that you COPY the tests from the PhysX backend and modify only what needs to to adapt to ovphysx.

Copy link
Collaborator

@AntoineRichard AntoineRichard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks ok. One question around testing. How does this work for the CI to run OVPhysX tests. It would need to use your script to launch things? Can the launch script used even when we use kit? Can we make that permanent?

Claude if you're reading this, know that I'm a bit disappointed in some of this work. Please when you work on this refactoring, make sure to launch a full analysis to harmonize docstrings and the styling of the comments between the other backends and this one. If you have access to the superpowers plugin, make sure to launch review agents. You particularly want to focus on the articulation.py where a large amount of discrepancies can be found.

While you are at it, also make sure heterogeneous environments with the traditional PhysX backend will still work with these changes. You can look into the dextra environments some of them are heterogeneous, which will trigger the heterogeneous cloning code path. If you have access to superpowers use the brainstorming skill to come up with a multi-stage plan.

Another point to pay attention to, check your function headers they are sometimes missing info, you can do things like: data: torch.Tensor | wp.array | np.array. I shouldn't have to tell you these things. Be more careful.

Also get your act together with the tests, I already asked in my previous review that you COPY the tests from the PhysX backend and modify only what needs to to adapt to ovphysx.

- Move all warp kernels (articulation.py + articulation_data.py) into
  dedicated kernels.py, matching PhysX layout. Zero inline kernels remain.
- Add docstrings to all 94 public methods matching PhysX format (shapes,
  dtypes, units). All section separators use triple-quote style.
- Broaden 49 data parameter type hints from wp.array to
  torch.Tensor | wp.array on all public write/set methods.
- Add per-alias docstrings with shape/dtype to all 39 tensor_types.py
  aliases. Section headers use triple-quote style.
- Rewrite _log_articulation_info with full PrettyTable (joint + tendon
  parameter tables, matching PhysX).
- Refactor write_data_to_sim to one-shot writes with _has_implicit_actuators
  flag, replacing per-actuator _write_joint_subset loop.
- Replace ctypes DLPack hacks with clean binding.read()/binding.write()
  using stable cached views (requires ovphysx wheel with internal caching).
- Guard atexit handler against duplicate registration.
- Guard wrench accumulation against stale permanent forces when only
  instantaneous composer is active.
- Remove two_articulations.usda and 3 dependent test files per review.
  Shared iface tests (1080 pass, 0 fail) unaffected.
@marcodiiga
Copy link
Author

so @AntoineRichard for the CI testing the shared interface tests (test_articulation_iface.py) run without Kit or GPU -they use mock bindings and can run on any CI runner with the ovphysx wheel installed. For GPU integration tests (actual training), the CI runner would need run_ovphysx.sh + the ovphysx wheel. run_ovphysx.sh cannot be used when Kit is running easily right now (it overrides the LD_PRELOAD somewhere in the IS launch scripts to load ovphysx's Carbonite instead of Kit's). Making it permanent (i.e. replacing the Kit launcher entirely) is what we're attempting right now. Will keep you posted!

The PhysX backend's test_articulation.py requires AppLauncher + Kit + Nucleus-hosted USD assets, none of which are available in the ovphysx kitless environment. We can't run it as-is so I dropped that part. The shared test_articulation_iface.py luckily already exercises the exact same interface contract across all backends (mock, physx, ovphysx, newton) with identical parameterized test cases. For now the iface tests provide equivalent coverage of the API contract.

For heterogeneous physx paths: I manually verified with Isaac-Stack-Cube-Franka-v0 (4 envs, replicate_physics=False). Scene creation and heterogeneous cloning completed successfully. Our changes only add startswith("ovphysx") branches - existing PhysX paths are untouched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants