Device Management in Multi-GPU systems, v2 #182

adenzler-nvidia · 2025-04-24T15:16:03Z

replaces #130 after discussions about best practices.

All the API functions now use wp.ScopedDevice. This has the benefit of resetting to the old active context after the function, which can avoid issues if you're doing other stuff on your GPU in-between MJWarp simulations.

We use the device use in put_model as the source of truth. So this does not enable any fancy multi-GPU running, it only makes sure that you use the same GPU for everything MjWarp.

I opted for a test that checks if all the API functions have a scopedDevice block, I didn't figure out a better way to test this.

Also, I'm really sorry for whoever has to review this, but bascially all the changes are intendation changes.

erikfrey · 2025-04-24T21:07:09Z

@adenzler-nvidia sorry to ask a really dumb question but what is the use case here?

At first glance, I would have expected something like wp.ScopedDevice() to be a user concept, e.g. I would expect the user (not us) to do something like:

device = ...
m = io.put_model(..., device=device)
d = io.put_data(..., device=device)
with wp.ScopedDevice(device):
   mjwarp.step(m, d)

adenzler-nvidia · 2025-04-25T06:55:19Z

valid question - we could push this entirely to the user for sure. Our experience as a user ourselves tells us that it's very easy to forget this though, and random weird stuff happens once you have systems with multiple GPUs and other parts of your workflow also doing GPU work.

The trickyness is mostly about how warp is selecting the default device. This default device can change if any other part of your system changes the currently bound CUDA context, and at that point you either pay the price for data migration/remote access and some stuff even stops working. We also need to make sure we're not messing with the currently bound context of other users on the system by restoring the existing state once we're done.

Given that we are likely going to be using MjWarp with in tandem with ML workloads, rendering, user code that also uses warp, etc this introduces a bit of safety for everyone.

We definitely want to reconsider this once we really thing about multi-GPU with MjWarp. Right now this just follows best practices we use on all of our other warp code. That being said - I'm also not a fan of enforcing ScopedDevice on all interface functions, and the API could be better by only having to specify the device for the model, and inferring it for data. But that would mean passing the model to put_data.

Curious to hear opinions though.

btaba · 2025-04-25T16:21:09Z

@adenzler-nvidia what's holding things back from considering multi-GPU in the shorter term? Coming from JAX and some of our recent work in MuJoCo land, multi-GPU has been seamless to use and critically important for our research velocity

adenzler-nvidia · 2025-04-28T06:57:40Z

Nothing specific - happy to talk about multi-GPU. Interested in how you guys have been using it so far.

I'm seeing this PR as a first stepping stone to make sure we're getting the device right for 1 GPU, but we can go for more immediately after that. The main point is to go from an implicit device selection to something explicit.

adenzler-nvidia · 2025-04-28T16:04:17Z

Plan after offline discussion with @erikfrey:

let's abandon this and gather requirements about multi-GPU use-cases immediately. Goal is not to implement it in beta, but get a head-start on the changes/additions we need in warp to make sure we're ready for this ASAP.

adenzler-nvidia · 2025-05-14T13:47:46Z

closing this - outdated and see comment above we decided to do this for real.

adenzler-nvidia added 30 commits April 24, 2025 10:54

io.py with scopedDevice

beb5034

collision

ab1ec4c

nxn broadphase

dafb0b3

sap broadphase

e7e9ab6

primitive narrowphase

2fa3cea

make_constraint

8591d76

euler

a70ece1

forward and fwd_acceleration

8119575

fwd_actuation

b616e75

fwd_position

0968897

fwd_velocity

2b86709

implicit

c0ac2ec

rungeKutta4

35f905a

step

667451b

passive

9d226f8

sensor_acc

1aade98

sensor_pos

ac45bbb

sensor_vel

1816393

com_pos

2060b39

com_vel

b0decf3

crb

62f0807

factor_m

15fff11

kinematics

7f08964

rne

fcb8455

rne_postconstraint

21339e9

solve_m

daa52a3

subtree_vel

ebecd76

tendon

56e8840

transmission

ea4f0bc

solve

69e84c8

adenzler-nvidia added 10 commits April 24, 2025 11:40

remove warp function from API

af663bf

mul_m

98c4760

xfrc_accumulate

d750cf8

benchmark

417edd3

fix missing else in euler

4c1feee

add test

f8635a6

formatting

e2f9f77

ruff fixes

012a2a4

Merge branch 'main' into dev/adenzler/device-management-v2

aca1e8b

Merge branch 'main' into dev/adenzler/device-management-v2

82f7350

adenzler-nvidia mentioned this pull request Apr 24, 2025

Device Management in Multi-GPU systems #130

Closed

ruff format

6606ff4

adenzler-nvidia requested a review from erikfrey April 24, 2025 15:47

adenzler-nvidia closed this May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device Management in Multi-GPU systems, v2 #182

Device Management in Multi-GPU systems, v2 #182

Uh oh!

adenzler-nvidia commented Apr 24, 2025

Uh oh!

erikfrey commented Apr 24, 2025

Uh oh!

adenzler-nvidia commented Apr 25, 2025

Uh oh!

btaba commented Apr 25, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025

Uh oh!

adenzler-nvidia commented May 14, 2025

Uh oh!

Uh oh!

Device Management in Multi-GPU systems, v2 #182

Device Management in Multi-GPU systems, v2 #182

Uh oh!

Conversation

adenzler-nvidia commented Apr 24, 2025

Uh oh!

erikfrey commented Apr 24, 2025

Uh oh!

adenzler-nvidia commented Apr 25, 2025

Uh oh!

btaba commented Apr 25, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025

Uh oh!

adenzler-nvidia commented May 14, 2025

Uh oh!

Uh oh!