Skip to content

[Bug Report] delayed_buffer and its internal circular_buffer performance #4274

@iansseijelly

Description

@iansseijelly

Describe the bug

This is not a correctness bug but rather a performance issue that could be addressed to improve the training throughput.

During profiling the training process, we found a non-trivial amount of time spent in _apply_actuator_model, which tracks down to the compute function in delayed pd actuator. This profiling is using the sampling python profiler py-spy, so it provides an application-view of where time is spent and is rather coarse-grained.

Image

We created a standalone MVP with just the delayed actuators and profiled it with torch profiler. The biggest CPU time slice is spent on excessive calls to the aten::item, which is an artifact of storing the max_length property, which should be an integer, but is stored as a tensor on the GPU. The second biggest contributing factor is the calls to torch.any, which should evaluate to false after the policy warms up and reset should be infrequent.
Attached is the complete torch profile for an unoptimized run:

Time taken: 317.0001810067333 ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            aten::copy_         0.60%       1.838ms         1.55%       4.721ms       5.245us       1.489ms        17.98%       1.489ms       1.655us           900  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.489ms        17.98%       1.489ms       1.655us           900  
                                              aten::any         1.14%       3.469ms         1.76%       5.377ms       8.962us       1.394ms        16.82%       1.394ms       2.323us           600  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       1.394ms        16.82%       1.394ms       2.323us           600  
                                            aten::index         0.76%       2.318ms         1.38%       4.205ms      14.017us       1.241ms        14.97%       1.241ms       4.135us           300  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.241ms        14.97%       1.241ms       4.135us           300  
                                             aten::item         0.31%     943.349us        89.39%     273.006ms     227.505us       0.000us         0.00%       1.188ms       0.990us          1200  
                              aten::_local_scalar_dense         0.81%       2.460ms        89.08%     272.062ms     226.719us       1.185ms        14.31%       1.188ms       0.990us          1200  
                         Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us       1.185ms        14.31%       1.185ms       0.988us          1200  
                                          aten::minimum         0.37%       1.116ms         0.63%       1.935ms       6.451us     799.949us         9.66%     799.949us       2.666us           300  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us     799.949us         9.66%     799.949us       2.666us           300  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us     682.763us         8.24%     682.763us       1.138us           600  
                                               aten::eq         0.84%       2.574ms         1.35%       4.112ms       6.853us     681.452us         8.23%     681.452us       1.136us           600  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     681.452us         8.23%     681.452us       1.136us           600  
                                              aten::sub         0.72%       2.191ms         1.20%       3.656ms       6.093us     663.177us         8.01%     663.177us       1.105us           600  
                                       aten::is_nonzero         0.17%     512.324us        45.53%     139.053ms     231.755us       0.000us         0.00%     570.631us       0.951us           600  
                                        aten::remainder         0.52%       1.581ms         0.78%       2.367ms       7.890us     487.948us         5.89%     487.948us       1.626us           300  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us     487.948us         5.89%     487.948us       1.626us           300  
                                            aten::clone         0.21%     651.118us         0.96%       2.918ms       9.727us       0.000us         0.00%     463.562us       1.545us           300  
                                             aten::add_         0.28%     859.122us         0.53%       1.619ms       5.398us     342.886us         4.14%     342.886us       1.143us           300  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 305.406ms
Self CUDA time total: 8.284ms

This is the benchmark used:

import torch
import torch.profiler as prof
import time

from delayed_buffer import DelayBuffer

HISTORY_LENGTH = 10
BATCH_SIZE = 4096
DEVICE = "cuda:0"

if __name__ == "__main__":
    buffer1 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
    buffer2 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
    buffer3 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
    data = torch.randn(BATCH_SIZE, 100).to(DEVICE)
    # Warm-up runs
    for _ in range(5):
        buffer1.compute(data)
        buffer2.compute(data)
        buffer3.compute(data)
    torch.cuda.synchronize()
    print("Warm-up done")

    with prof.profile(
        activities=[prof.ProfilerActivity.CPU, prof.ProfilerActivity.CUDA],
        record_shapes=True) as p:
        start_time = time.perf_counter()
        for i in range(100):
            buffer1.compute(data)
            buffer2.compute(data)
            buffer3.compute(data)
        end_time = time.perf_counter()
    print(f"Time taken: {(end_time - start_time)*1000} ms")
    p.export_chrome_trace("trace.json")
    print(p.key_averages().table(sort_by="cuda_time_total", row_limit=20))

Steps to reproduce

Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

System Info

Describe the characteristic of your environment:

  • Commit: [e.g. 8f3b9ca]
  • Isaac Sim Version: [e.g. 5.0, this can be obtained by cat ${ISAACSIM_PATH}/VERSION]
  • OS: [e.g. Ubuntu 22.04]
  • GPU: [e.g. RTX 5090]
  • CUDA: [e.g. 12.8]
  • GPU Driver: [e.g. 553.05, this can be seen by using nvidia-smi command.]

Additional context

Add any other context about the problem here.

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have checked that the issue is not in running Isaac Sim itself and is related to the repo

Acceptance Criteria

Add the criteria for which this task is considered done. If not known at issue creation time, you can add this once the issue is assigned.

  • Criteria 1
  • Criteria 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions