Copied from flashinfer-ai#2187
flashinfer.testing.bench_gpu_time_with_cupti today supports flushing the L2 cache before each run, but bench_gpu_time_with_cuda_event does not.
This means that only in environments where cupti-python is installed we are able to microbenchmark performance with a cold L2, which is not widespread today (requires CUDA 13).
It would be helpful to implement a rotating buffer using input[round % N] and output[round % N] where N could be set according to L2 cache size to use a cold L2.
Copied from flashinfer-ai#2187
flashinfer.testing.bench_gpu_time_with_cupti today supports flushing the L2 cache before each run, but bench_gpu_time_with_cuda_event does not.
This means that only in environments where cupti-python is installed we are able to microbenchmark performance with a cold L2, which is not widespread today (requires CUDA 13).
It would be helpful to implement a rotating buffer using input[round % N] and output[round % N] where N could be set according to L2 cache size to use a cold L2.