why jitting outer function is slower #33912

zzhixin · 2025-12-12T15:24:46Z

zzhixin
Dec 12, 2025

tldr

DDPG is a value-based RL algorithm, which needs a reply buffer to store experiences (obs, action, reward, etc), which is implemented as a circular buffer in my case. Every step of training, one step experience is pushed into the buffer. Due to the pure functional nature of jax.jit, large reply buffer can lead to pushing experience time- and memory-expensive. So it is usefull to donate the buffer state arrays to ensure in-place modification operation to save a lot of runtime.

But I found jitting the outmost function train_one_step doesn't save any runtime(almost the same), it seems that the compiler necglect the signal from donation decorators and decides to allocate new buffer state array and copy the old whole buffer.. I failed to validate it because I tried to see the jaxpr but no information of whether to copy large memory.

What improve the performance significantly (5x) is to

split the outmost function train_one_step into rollout_and_push and update_model
and jit them respectively.
Specifically, donate the large buffer_state argument of rollout_and_push function, which is the critic point to ensure underhood in-place update of large buffer.

So why jitting together degrades? I guess from XLA’s perspective:

this buffer participates in a complex graph; proving safe in-place reuse is hard.I’ll materialize a new value to be safe.

code snippet

If you jit the outmost function train_one_step, the training will be 5x slower!

with train_one_step jitted: 37.45s
without train_one_step jitted: 7.8s

benmark code:

def benchmark(config, warmup, n_iters=1000):
    ... =  ddpg.prepare(...)
    ... = warmup(...)

    start_time = time.time()
    for iter in range(n_iters):
        ... = jax.block_until_ready(ddpg.train_one_step(...))
    print(f"{Fore.BLUE}Average time {(time.time() - start_time)/n_iters*1e3:.2f}ms")

# Only jit inner function
print("--------------- Only jit inner function ----------------")
benchmark(config, warmup=warmup_inner)
print(f"{Fore.GREEN}rollout_and_push compile times: {ddpg.rollout_and_push._cache_size()}")
print(f"{Fore.GREEN}update_model compile times: {ddpg.update_model._cache_size()}")
    

# Jit the outmost function train_one_step
print("\n-------- Jit the outmost function train_one_step ---------")
ddpg.train_one_step = jax.jit(
    ddpg.train_one_step,
    static_argnames=["qnet", "actor", "opt", "env", 
                     "buffer", "num_steps",
                     "is_update_target_model", "is_update_model"],
    donate_argnames=["buffer_state"])
benchmark(config, warmup=warmup_outer)
print(f"{Fore.GREEN}train_one_step compile times: {ddpg.train_one_step._cache_size()}")

function definition:

def train_one_step(
        key, 
        qnet: nn.Module, actor: nn.Module, opt: optax.GradientTransformation, 
        qnet_params, actor_params, target_qnet_params, target_actor_params,
        qnet_opt_state: optax.OptState, actor_opt_state: optax.OptState, 
        env, env_params: gymnax.EnvParams, env_state: gymnax.EnvState, 
        eps_cur: jax.Array,
        gamma: float, tau: float, 
        buffer: ReplayBuffer, buffer_state: ReplayBufferState,
        num_steps: int):
    
    key, rollout_key, sample_key = jax.random.split(key, 3)

    env_state, buffer_state = ddpg.rollout_and_push(rollout_key, 
                     actor, actor_params,
                     env, env_params, env_state, 
                     eps_cur,
                     buffer, buffer_state,
                     num_steps)

    sampled_experiences = buffer.sample(sample_key, buffer_state)
    loss, qnet_params, target_qnet_params, qnet_opt_state, actor_params, target_actor_params, actor_opt_state = \
        ddpg.update_model(sampled_experiences,
                    qnet, actor, opt, 
                    qnet_params, actor_params, target_qnet_params, target_actor_params,
                    qnet_opt_state, actor_opt_state, 
                    gamma
                    )
    return  key, loss, qnet_params, target_qnet_params, qnet_opt_state, actor_params, target_actor_params, actor_opt_state, env_state, buffer_state


@partial(
        jax.jit, 
        static_argnames=["actor", "env", "buffer", "num_steps"],
        donate_argnames=["buffer_state"]
)
def rollout_and_push(key, 
                     actor: nn.Module, actor_params,
                     env, env_params: gymnax.EnvParams, env_state: gymnax.EnvState, 
                     eps_cur: float,
                     buffer: ReplayBuffer, buffer_state: ReplayBufferState,
                     num_steps: int):
    # interact with the env and collect the experience and push experience to the buffer.
    ...

    return  env_state, buffer_state

@partial(
        jax.jit, 
        static_argnames=["qnet", "actor", "opt", "gamma"],
)
def update_model(sampled_experiences,
                 qnet: nn.Module, actor: nn.Module, opt: optax.GradientTransformation, 
                 qnet_params, actor_params, target_qnet_params, target_actor_params,
                 qnet_opt_state: optax.OptState, actor_opt_state: optax.OptState, 
                 gamma: float
                 ):
    # update the actor and critic nn model
    ...
    return  loss, qnet_params, target_qnet_params, qnet_opt_state, actor_params, target_actor_params, actor_opt_state

Complete code

https://github.com/zzhixin/jaxrl-learning/tree/b2a9d1ab64a413ed1cfcb03352f1c6c3a2b2be4f
Just run the benchmark_ddpg.py file.

Comments

Generally, one can expect no worse performance when jitting the outer function compared to jitting only the inner functions respectively. If I didn't do thing stupidly, this case shows sometimes jitting the outer function can lead to worse performance.

I suppose this is not a new discovery, but I the performance drop still surprised me. In theory, such performance degradation of jitting outer function strongly suggests you can always make a smarter jax compiler once the failure mode is identified. In other words, if, for certain case, the compiler can be smarter enough to automatically jitted seperately when one blindly jits the outer function, then the user experience can be more consistent, which means that one can obey the principle of "jit the outer function as much as possible" more confidently.

Questions

why jitting outer function lead to worse performance in this case?
how to validate what jitted function do, does it copy large memroy or not?
can we expect a smarter compiler to deal with such case?
if not, how should one decide to jitted inner function or outer function? for a project with many nested functions, it can be impossible to try out all combinations.

system info

cpu
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
memory: 64GB
system: Ubuntu 22.04.5 LTS
jax version: 0.8.0 (cpu)

jakevdp · 2025-12-12T16:31:07Z

jakevdp
Dec 12, 2025
Maintainer

If you uncomment the jit decorator on the top of train_one_step function, the training will be 5x slower!

Can you be more specific about what is 5x slower? How are you measuring this? Have you isolated compilation time from runtime? (see https://docs.jax.dev/en/latest/benchmarking.html#benchmarking-jax-code for some details).

6 replies

zzhixin Dec 13, 2025
Author

Appologize for the combersome code. But I failed to write a minimal code. The performance degradtion seems only occurs in a heavy training function. For example, if I remove the update_model function, the performance degradtion disappears.

jakevdp Dec 15, 2025
Maintainer

"5x slower" refers to "5 times slower" or "1/5 times execution speed".

I meant "what part of the code is 5x slower", not "what does 5x mean"

jakevdp Dec 15, 2025
Maintainer

I don't have any insight here, unfortunately. If you're able to reduce this further, it might become more clear.

zzhixin Dec 17, 2025
Author

Thanks for your kind replay. I have tried to make the benchmark code more clear. Cleaned the target function train_one_step and the benchmark function, which is literally for loop of train_one_step function. The original post and repo was updated accordingly. Now you can run this script. Now the result shows the average run time of train_one_step function directly, the gap between two cases is consistent with before:

zzhixin Dec 19, 2025
Author

I tried my best to reduce the code to one file, see here. I leave the original post unchanged, and illustrate the problem in a different and cleared way. Now the question turns to:
why the order of return variables affect the jitted function's performance so much ?

(I believe the real problem is the same, expressing in this way make code shorter)

As shows in the following pseudocode, apart from the returned variable order, everything is identical, but the performance differs a lot. This is very strange to me. I tried to reduce the code futher, but any significant reduction leads to the phenomenon disappearing. For example, if I reduce the model update stage to generating grad directly rather than computing from sampled data, the performance gap disappears. Or if I replace buffer with a bare array, also disappear. It is quite frustrating. Could you spare time to locate the problem?

Pseudocode:

def train_one_step_return_later(key, model_params, buffer, buffer_state):
    # sample data -> add to buffer -> sample from buffer -> update model
    ...
    return model_params, buffer_state, key

def train_one_step_return_early(key, model_params, buffer, buffer_state):
    model_params, buffer_state, key = train_one_step_return_later(
        key, model_params, buffer, buffer_state)
    return buffer_state, model_params, key

def benchmark_return_later():
    # timing jitted train_one_step_return_later
    ...
    
def benchmark_return_early():
    # timing jitted train_one_step_return_early
    ...

if __name__ == "__main__":
    print("-------- return later ---------")
    benchmark_return_later()
    print("\n-------- return early ---------")
    benchmark_return_early()

output:

-------- return later ---------
Average time: 638 microseconds

-------- return early ---------
Average time: 73 microseconds

why jitting outer function is slower #33912

Uh oh!

Uh oh!

zzhixin Dec 12, 2025

tldr

code snippet

benmark code:

function definition:

Complete code

Comments

Questions

system info

Replies: 1 comment · 6 replies

Uh oh!

Uh oh!

jakevdp Dec 12, 2025 Maintainer

Uh oh!

zzhixin Dec 13, 2025 Author

Uh oh!

jakevdp Dec 15, 2025 Maintainer

Uh oh!

jakevdp Dec 15, 2025 Maintainer

Uh oh!

Uh oh!

zzhixin Dec 17, 2025 Author

Uh oh!

Uh oh!

zzhixin Dec 19, 2025 Author

zzhixin
Dec 12, 2025

Replies: 1 comment 6 replies

jakevdp
Dec 12, 2025
Maintainer

zzhixin Dec 13, 2025
Author

jakevdp Dec 15, 2025
Maintainer

jakevdp Dec 15, 2025
Maintainer

zzhixin Dec 17, 2025
Author

zzhixin Dec 19, 2025
Author