[Bug Report] delayed_buffer and its internal circular_buffer performance

### Describe the bug

This is not a correctness bug but rather a performance issue that could be addressed to improve the training throughput. 

During profiling the training process, we found a non-trivial amount of time spent in `_apply_actuator_model`, which tracks down to the `compute` function in delayed pd actuator. This profiling is using the sampling python profiler `py-spy`, so it provides an application-view of where time is spent and is rather coarse-grained. 

<img width="1512" height="859" alt="Image" src="https://github.com/user-attachments/assets/53d6fe09-980a-427e-8c2b-b4491c65d8b0" />

We created a standalone MVP with just the delayed actuators and profiled it with torch profiler. The biggest CPU time slice is spent on excessive calls to the `aten::item`, which is an artifact of storing the `max_length` property, which should be an integer, but is stored as a tensor on the GPU. The second biggest contributing factor is the calls to `torch.any`, which should evaluate to false after the policy warms up and reset should be infrequent. 
Attached is the complete torch profile for an unoptimized run:
```
Time taken: 317.0001810067333 ms
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            aten::copy_         0.60%       1.838ms         1.55%       4.721ms       5.245us       1.489ms        17.98%       1.489ms       1.655us           900  
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us       1.489ms        17.98%       1.489ms       1.655us           900  
                                              aten::any         1.14%       3.469ms         1.76%       5.377ms       8.962us       1.394ms        16.82%       1.394ms       2.323us           600  
void at::native::reduce_kernel<512, 1, at::native::R...         0.00%       0.000us         0.00%       0.000us       0.000us       1.394ms        16.82%       1.394ms       2.323us           600  
                                            aten::index         0.76%       2.318ms         1.38%       4.205ms      14.017us       1.241ms        14.97%       1.241ms       4.135us           300  
void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.241ms        14.97%       1.241ms       4.135us           300  
                                             aten::item         0.31%     943.349us        89.39%     273.006ms     227.505us       0.000us         0.00%       1.188ms       0.990us          1200  
                              aten::_local_scalar_dense         0.81%       2.460ms        89.08%     272.062ms     226.719us       1.185ms        14.31%       1.188ms       0.990us          1200  
                         Memcpy DtoH (Device -> Pinned)         0.00%       0.000us         0.00%       0.000us       0.000us       1.185ms        14.31%       1.185ms       0.988us          1200  
                                          aten::minimum         0.37%       1.116ms         0.63%       1.935ms       6.451us     799.949us         9.66%     799.949us       2.666us           300  
void at::native::unrolled_elementwise_kernel<at::nat...         0.00%       0.000us         0.00%       0.000us       0.000us     799.949us         9.66%     799.949us       2.666us           300  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us     682.763us         8.24%     682.763us       1.138us           600  
                                               aten::eq         0.84%       2.574ms         1.35%       4.112ms       6.853us     681.452us         8.23%     681.452us       1.136us           600  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     681.452us         8.23%     681.452us       1.136us           600  
                                              aten::sub         0.72%       2.191ms         1.20%       3.656ms       6.093us     663.177us         8.01%     663.177us       1.105us           600  
                                       aten::is_nonzero         0.17%     512.324us        45.53%     139.053ms     231.755us       0.000us         0.00%     570.631us       0.951us           600  
                                        aten::remainder         0.52%       1.581ms         0.78%       2.367ms       7.890us     487.948us         5.89%     487.948us       1.626us           300  
void at::native::vectorized_elementwise_kernel<2, at...         0.00%       0.000us         0.00%       0.000us       0.000us     487.948us         5.89%     487.948us       1.626us           300  
                                            aten::clone         0.21%     651.118us         0.96%       2.918ms       9.727us       0.000us         0.00%     463.562us       1.545us           300  
                                             aten::add_         0.28%     859.122us         0.53%       1.619ms       5.398us     342.886us         4.14%     342.886us       1.143us           300  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 305.406ms
Self CUDA time total: 8.284ms
```
This is the benchmark used:
```
import torch
import torch.profiler as prof
import time

from delayed_buffer import DelayBuffer

HISTORY_LENGTH = 10
BATCH_SIZE = 4096
DEVICE = "cuda:0"

if __name__ == "__main__":
    buffer1 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
    buffer2 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
    buffer3 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
    data = torch.randn(BATCH_SIZE, 100).to(DEVICE)
    # Warm-up runs
    for _ in range(5):
        buffer1.compute(data)
        buffer2.compute(data)
        buffer3.compute(data)
    torch.cuda.synchronize()
    print("Warm-up done")

    with prof.profile(
        activities=[prof.ProfilerActivity.CPU, prof.ProfilerActivity.CUDA],
        record_shapes=True) as p:
        start_time = time.perf_counter()
        for i in range(100):
            buffer1.compute(data)
            buffer2.compute(data)
            buffer3.compute(data)
        end_time = time.perf_counter()
    print(f"Time taken: {(end_time - start_time)*1000} ms")
    p.export_chrome_trace("trace.json")
    print(p.key_averages().table(sort_by="cuda_time_total", row_limit=20))
```

### Steps to reproduce

Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.



### System Info

Describe the characteristic of your environment:


- Commit: [e.g. 8f3b9ca]
- Isaac Sim Version: [e.g. 5.0, this can be obtained by `cat ${ISAACSIM_PATH}/VERSION`]
- OS: [e.g. Ubuntu 22.04]
- GPU: [e.g. RTX 5090]
- CUDA: [e.g. 12.8]
- GPU Driver: [e.g. 553.05, this can be seen by using `nvidia-smi` command.]

### Additional context

Add any other context about the problem here.

### Checklist

- [x] I have checked that there is no similar issue in the repo (**required**)
- [x] I have checked that the issue is not in running Isaac Sim itself and is related to the repo

### Acceptance Criteria

Add the criteria for which this task is considered **done**. If not known at issue creation time, you can add this once the issue is assigned.

- [ ] Criteria 1
- [ ] Criteria 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug Report] delayed_buffer and its internal circular_buffer performance #4274

Describe the bug

Steps to reproduce

System Info

Additional context

Checklist

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug Report] delayed_buffer and its internal circular_buffer performance #4274

Description

Describe the bug

Steps to reproduce

System Info

Additional context

Checklist

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions