-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Describe the bug
This is not a correctness bug but rather a performance issue that could be addressed to improve the training throughput.
During profiling the training process, we found a non-trivial amount of time spent in _apply_actuator_model, which tracks down to the compute function in delayed pd actuator. This profiling is using the sampling python profiler py-spy, so it provides an application-view of where time is spent and is rather coarse-grained.
We created a standalone MVP with just the delayed actuators and profiled it with torch profiler. The biggest CPU time slice is spent on excessive calls to the aten::item, which is an artifact of storing the max_length property, which should be an integer, but is stored as a tensor on the GPU. The second biggest contributing factor is the calls to torch.any, which should evaluate to false after the policy warms up and reset should be infrequent.
Attached is the complete torch profile for an unoptimized run:
Time taken: 317.0001810067333 ms
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::copy_ 0.60% 1.838ms 1.55% 4.721ms 5.245us 1.489ms 17.98% 1.489ms 1.655us 900
Memcpy DtoD (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 1.489ms 17.98% 1.489ms 1.655us 900
aten::any 1.14% 3.469ms 1.76% 5.377ms 8.962us 1.394ms 16.82% 1.394ms 2.323us 600
void at::native::reduce_kernel<512, 1, at::native::R... 0.00% 0.000us 0.00% 0.000us 0.000us 1.394ms 16.82% 1.394ms 2.323us 600
aten::index 0.76% 2.318ms 1.38% 4.205ms 14.017us 1.241ms 14.97% 1.241ms 4.135us 300
void at::native::index_elementwise_kernel<128, 4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.241ms 14.97% 1.241ms 4.135us 300
aten::item 0.31% 943.349us 89.39% 273.006ms 227.505us 0.000us 0.00% 1.188ms 0.990us 1200
aten::_local_scalar_dense 0.81% 2.460ms 89.08% 272.062ms 226.719us 1.185ms 14.31% 1.188ms 0.990us 1200
Memcpy DtoH (Device -> Pinned) 0.00% 0.000us 0.00% 0.000us 0.000us 1.185ms 14.31% 1.185ms 0.988us 1200
aten::minimum 0.37% 1.116ms 0.63% 1.935ms 6.451us 799.949us 9.66% 799.949us 2.666us 300
void at::native::unrolled_elementwise_kernel<at::nat... 0.00% 0.000us 0.00% 0.000us 0.000us 799.949us 9.66% 799.949us 2.666us 300
void at::native::vectorized_elementwise_kernel<2, at... 0.00% 0.000us 0.00% 0.000us 0.000us 682.763us 8.24% 682.763us 1.138us 600
aten::eq 0.84% 2.574ms 1.35% 4.112ms 6.853us 681.452us 8.23% 681.452us 1.136us 600
void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 681.452us 8.23% 681.452us 1.136us 600
aten::sub 0.72% 2.191ms 1.20% 3.656ms 6.093us 663.177us 8.01% 663.177us 1.105us 600
aten::is_nonzero 0.17% 512.324us 45.53% 139.053ms 231.755us 0.000us 0.00% 570.631us 0.951us 600
aten::remainder 0.52% 1.581ms 0.78% 2.367ms 7.890us 487.948us 5.89% 487.948us 1.626us 300
void at::native::vectorized_elementwise_kernel<2, at... 0.00% 0.000us 0.00% 0.000us 0.000us 487.948us 5.89% 487.948us 1.626us 300
aten::clone 0.21% 651.118us 0.96% 2.918ms 9.727us 0.000us 0.00% 463.562us 1.545us 300
aten::add_ 0.28% 859.122us 0.53% 1.619ms 5.398us 342.886us 4.14% 342.886us 1.143us 300
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 305.406ms
Self CUDA time total: 8.284ms
This is the benchmark used:
import torch
import torch.profiler as prof
import time
from delayed_buffer import DelayBuffer
HISTORY_LENGTH = 10
BATCH_SIZE = 4096
DEVICE = "cuda:0"
if __name__ == "__main__":
buffer1 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
buffer2 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
buffer3 = DelayBuffer(HISTORY_LENGTH+1, BATCH_SIZE, DEVICE)
data = torch.randn(BATCH_SIZE, 100).to(DEVICE)
# Warm-up runs
for _ in range(5):
buffer1.compute(data)
buffer2.compute(data)
buffer3.compute(data)
torch.cuda.synchronize()
print("Warm-up done")
with prof.profile(
activities=[prof.ProfilerActivity.CPU, prof.ProfilerActivity.CUDA],
record_shapes=True) as p:
start_time = time.perf_counter()
for i in range(100):
buffer1.compute(data)
buffer2.compute(data)
buffer3.compute(data)
end_time = time.perf_counter()
print(f"Time taken: {(end_time - start_time)*1000} ms")
p.export_chrome_trace("trace.json")
print(p.key_averages().table(sort_by="cuda_time_total", row_limit=20))
Steps to reproduce
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.
System Info
Describe the characteristic of your environment:
- Commit: [e.g. 8f3b9ca]
- Isaac Sim Version: [e.g. 5.0, this can be obtained by
cat ${ISAACSIM_PATH}/VERSION] - OS: [e.g. Ubuntu 22.04]
- GPU: [e.g. RTX 5090]
- CUDA: [e.g. 12.8]
- GPU Driver: [e.g. 553.05, this can be seen by using
nvidia-smicommand.]
Additional context
Add any other context about the problem here.
Checklist
- I have checked that there is no similar issue in the repo (required)
- I have checked that the issue is not in running Isaac Sim itself and is related to the repo
Acceptance Criteria
Add the criteria for which this task is considered done. If not known at issue creation time, you can add this once the issue is assigned.
- Criteria 1
- Criteria 2