Model batching #109

adenzler-nvidia · 2025-04-03T16:02:00Z

Sharing WIP for model batching. Closes #63.

This right now contains the following changes:

move nworld parameter to the model
add an "expand_fields" set to put_model such that we know which arrays to tile correctly.
all other arrays are stride 0 in the first dimension, which means a constant over all worlds.

The set of arrays that can be made per-world is very much a best guess, I tried not to include anything topology relevant but might have missed some.

mujoco_warp/_src/forward.py

…test.

erikfrey · 2025-04-05T14:06:22Z

Whew this is gonna be a monster of a PR :-) thank you so much for taking this on.

Two design considerations:

Model is pretty heavy - in JAX we deal with this by letting the user explicitly choose which arrays to expand, and the others are left unbatched. See here for example, we produce a Model with only 5 batched fields:

https://github.com/google-deepmind/mujoco_playground/blob/main/mujoco_playground/_src/locomotion/g1/randomize.py#L92

I think this one is important.

Less important than 1, but: could we allow for a different Model batch size than nworld? I'd keep those two concepts separate. Let's say nworld is 4 and nmodel (or whatever we call it) is 2, then:

Model Id	Data Id
0	0
1	1
0	2
1	3

WDYT of these two design factors?

erikfrey · 2025-04-06T16:05:21Z

Oh! I just looked at the code and I see you have 1 handled already, very cool. So it's more about 2, let me know what you think.

adenzler-nvidia · 2025-04-07T06:52:43Z

Yeah I think we got 1 covered nicely with the stride 0 arrays - still a few wonky things that I need to iron out but I think the approach works.

For 2 - I didn't think of that but it's certainly possible. Do you have a JAX example somewhere handy about how you're usually doing this? I can see it being a bit weird for kernel writers, as there might be 2 different batching indices at that point, which might not be obvious from the get-go. I would prefer to avoid having an indirect lookup (like modelid = d.modelid[worldid]) in every kernel but if we can get the model index from the world index using a calculation it should be fine. Depends a bit on the requirements here.

eric-heiden · 2025-04-07T16:16:35Z

Can we assume nworld always to be a multiple of nmodel, i.e. there are always a constant number of N >= 1 states per model? In that case we could just have modelid = worldid % nmodel.

erikfrey · 2025-04-07T23:33:02Z

@eric-heiden Sure, I think we could, but wouldn't modelid = worldid % nmodel work even if nworld is not a perfect multiple of nmodel?

@adenzler-nvidia One scenario I'm thinking of that I think we'll want quite soon is domain randomizing the objects in a scene, e.g. if we want to train a "grasp anything" type policy - actually forcing the user to have 8k objects on the model (or whatever nworld happens to be) might be prohibitive, not just for the user to supply but what we can populate on device.

Generally speaking, would it be safe to query the shape of the model array in question for the item to retrive? So something like:

marginid = worldid % m.geom_margin.shape[0]
margin = m.geom_margin[marginid, ...

adenzler-nvidia · 2025-04-08T08:15:37Z

There is an obvious need for running heterogeneous environments for sure. I don't think it makes a lot of sense to implement that in this PR as we still need to figure out how to do that best when looking at performance. It also depends a lot on what exactly is randomized - if all the objects have the same tree topology, that is a different thing that suddenly having a different tree for each world. For example, just changing the collision geometry of a free-jointed object is going to be easy, but then having two different robots is a completely different beast.

We need to think about that not only in terms of worlds, but rather over what axis we should parallelize. I think if we go fully heterogeneous, there needs to be a compilation step that reorders some of the subtrees such that we can extract as much parallelism as possible. The tricky part at that point is how we remap all the parameters, whether that is an object id or a world id or a model id. And then we need to think about de-duplication, how to make sure we're limiting memory usage.

I think we can separate API and implementation level concerns here a bit, but we need to be clear on the requirements to avoid driving ourselves into a corner. On top of that, we need to come up with something that still makes it possible to stay sane while developing and debugging the engine.

erikfrey · 2025-04-08T23:34:01Z

Oh definitely agree we should not implement it now. I think my suggestion is exactly to avoid driving ourselves into a corner as you put it - if we explicitly tie nworld to both Model and Data we may have to undo it later, possibly leading to wailing and gnashing of teeth of our users.

That's why I'm suggesting an option that seems to impose the fewest API assumptions that we may have to undo later, which is to just query the array shape for the Model field in question, and use that - is there somewhere that that may bite us?

adenzler-nvidia · 2025-04-09T09:04:56Z

Makes sense - I'll test drive the shape lookup today.

So in the end you would avoid having any world/modelid parameter on the model side alltogether, and just allow any size in the first dimension? I'll have to test drive how well this works from an API point of view.

My biggest concern here is that it might become unnecessarily hard for us developers, trying to figure out which part of the model has which size etc. Unless we can somehow get the model arrays to wrap around, that would be cool.

erikfrey · 2025-04-09T15:56:01Z

So in the end you would avoid having any world/modelid parameter on the model side alltogether, and just allow any size in the first dimension? I'll have to test drive how well this works from an API point of view.

Just for now until there's more clarity on what parameters we'll want that encompass all our model batching use cases.

adenzler-nvidia · 2025-04-11T14:34:58Z

went ahead and implemented the modulo approach here: https://github.com/adenzler-nvidia/mujoco_warp/tree/dev/adenzler/modulo-experiments

Roughly, what I did is to have all the "expandable" model fields have shape 1 and stride 0 in the first dim. So that's equivalent to the proposal above. But then as a user you can replace that by an array of any shape in the first dim to get more or less arbitrary model-> world mappings and then just do the modulo calculation.

Interestingly, the modulo approach is quite a bit (~2%) faster on the humanoid. What is even more confusing is that the change in runtime is in the solver, which currently isn't even touched by any of the changes I did here. I would have expected it to be the other way around given that we do more computation during array indexing, which happens often and can be one of those silent performance killers. So right now my gut feeling is that I have a bug somewhere else that makes the solver terminate earlier or otherwise do less work. Need to figure that out.

That being said, I think the approach can work. My main reservations are:

it's going to be hard for devs to figure out when to do the modulo indexing. I'm pretty sure we're going to mess this up all the time
it's not having a big effect on perf right now, but we're bottlenecked/suffering the consequences of launch tons of tiny kernels with almost empty threads. I'm a bit worried that a change like this is something we cannot reverse on even if it starts becoming an issue, because it's part of the API. Likely we also won't really ever know that it will be an issue because it's not a big block showing up on a profile, but rather small inefficiencies scattered all over the place.

I'm a bit out of ideas but will try to explore some more. What's clear to me is that we need to find ways of not having to pay the price for the complexities of all kinds of batching if you don't need the complexity in the first place. Maybe some clever use of wp.static is important here.

erikfrey · 2025-04-11T17:13:55Z

I hear you that these changes introduce more opportunities for bugs. Maybe between this PR and #148 it's worth thinking through some helper interfaces to minimize the surface area and verbosity. What do you think?

adenzler-nvidia · 2025-04-25T07:54:03Z

Reading through this again, I think it's time to make a decision. It sounds clear to me that we do not want modelId to be tied to worldId, which makes a lot of sense.

So the remaining question is whether to allow different sizes for different model parameters. I'm torn on that one personally - I like the idea of having 1 modelid that we can calculate upfront and then use for all model fields, it's simple. On the other hand, I can see the benefits of having different sizes, but I'm worried about the price of looking up shapes all the time.

I think I can make both approaches work though, with some helpers. I think the goal should be to make the easy use-case (nmodel == nworld) usable without too many restrictions, I guess.

WDYT?

btaba · 2025-04-25T17:05:30Z

@adenzler-nvidia pointed me to take a look at this PR. Here are my high-level thoughts:

I strongly suspect expand_fields will not play nicely with our JAX workflow, but I could be wrong (I'd need to go through the motions). The modulo PR seems like the better approach.
If there's a concern about maintainability with the modulo approach, we can override __getitem__ on model fields.

If there isn't a big rush to get this PR in, I would wait for a real-use case to battle test the impl (i.e. JAX interop with MJX and domain rando hooked up, which is more or less imminent)

adenzler-nvidia · 2025-04-28T07:06:41Z

heads-up: I'm currently working on the next version in a new branch, will retarget this MR as soon as I have all the tests passing. Going for a modulo version with helpers.

So the expand-fields approach is dead, I agree it's unlikely to play well with JAX. Happy to test-drive this with a real use-case, let me know when you have something ready.

adenzler-nvidia · 2025-04-28T09:39:26Z

New MR: #195

adenzler-nvidia · 2025-04-28T16:02:38Z

Plan after offline discussion @erikfrey:

let's forget about the modulo indexing and just have all the batched fields be size nworlds, and have the user do repeat data for now. But only expands fields to have different data for different model batches.
use the kernel analyzer to enforce correctness around what field can be expanded and what can't.
wait with this until [WIP] API changes to address multiple issues. #148 is merged.

adenzler-nvidia · 2025-05-09T12:55:31Z

closing in favor of #231

adenzler-nvidia added 23 commits April 2, 2025 16:06

move nworld to model.

9455124

qpos0 with stride 0, seems to work well.

c990e0b

add ability to expand to real array

2ba4bca

converting qpos_spring

815455a

move to more descriptive naming

1f57d9e

some more conversions

78f1dcf

invweight0

b57aaf8

remove body_contype and body_conaffinity

3ab0e6d

jnt_solref anf jnt_solimp

e18eee8

jnt_range

df0236e

jnt_stiffness, jnt_margin, jnt_actfrcrange

e5d14b4

geom_contype and conaffinity

84cd8c9

geom_pos, geom_quat, geom_priority

1bb6b62

geom_solref, solimp, mix, friction, margin, gap

4615c9a

size pos, quat

8e04aa2

dof_armature

86a2f56

dof_damping

59792a0

dof invweight and damping

7422a23

final pieces

9dc58a6

update documentation comments

31147e4

don't create temporary array

452b6cc

better auto-expand

bf32f95

WIP test

fab9ac7

eric-heiden reviewed Apr 3, 2025

View reviewed changes

mujoco_warp/_src/forward.py Outdated Show resolved Hide resolved

adenzler-nvidia added 6 commits April 4, 2025 10:21

Merge branch 'main' into dev/adenzler/model-batching-v2

626eee2

add tiling for new fields in put_data

fabcd60

fixes after merging main

8c1433e

add simple test

0221529

formatting

4f42c8c

fix ruff errors

da0cfbe

adenzler-nvidia added 2 commits April 4, 2025 13:40

adjust tolerance in smooth_test a bit to avoid flakyness in camlight …

51bdc48

…test.

remove accidental file commit

58e2067

adenzler-nvidia added 5 commits April 7, 2025 10:11

Merge branch 'main' into dev/adenzler/model-batching-v2

a5b5cd3

some fixes after merging main

45dd091

more fixes

3d240e5

obvious mistake that is tricky to spot. Needs a better error.

c424f83

formatting

2f052e3

adenzler-nvidia mentioned this pull request Apr 28, 2025

Model batching v2 #195

Closed

adenzler-nvidia mentioned this pull request May 9, 2025

Model Batching V4 #231

Merged

adenzler-nvidia closed this May 9, 2025

Model batching #109

Model batching #109

Uh oh!

Conversation

adenzler-nvidia commented Apr 3, 2025

Uh oh!

Uh oh!

erikfrey commented Apr 5, 2025

Uh oh!

erikfrey commented Apr 6, 2025

Uh oh!

adenzler-nvidia commented Apr 7, 2025

Uh oh!

eric-heiden commented Apr 7, 2025

Uh oh!

erikfrey commented Apr 7, 2025

Uh oh!

adenzler-nvidia commented Apr 8, 2025

Uh oh!

erikfrey commented Apr 8, 2025

Uh oh!

adenzler-nvidia commented Apr 9, 2025

Uh oh!

erikfrey commented Apr 9, 2025

Uh oh!

adenzler-nvidia commented Apr 11, 2025

Uh oh!

erikfrey commented Apr 11, 2025

Uh oh!

adenzler-nvidia commented Apr 25, 2025

Uh oh!

btaba commented Apr 25, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025

Uh oh!

adenzler-nvidia commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adenzler-nvidia commented May 9, 2025

Uh oh!

Uh oh!

adenzler-nvidia commented Apr 28, 2025 •

edited

Loading