Description
Hi, I've been trying to use BenchmarkDotNet
to profile the memory usage improvements in a new version of ComputeSharp
, but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with [MemoryDiagnoser]
, as the reported memory allocations seem a bit off.
All the code below and results are from the investigation/bdn
branch in the ComputeSharp
repo, for reference.
Repro steps
- Add the CI
nuget.config
file forBenchmarkDotNet
as explained in this comment - Clone the repo, checkout to
investigation/bdn
- Build
ComputeSharp.Benchmark
in Release, run the benchmark as usual withdotnet ComputeSharp.Benchmark
Details
Running that benchmark gives me the following:
Benchmark results (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-TWEPVA(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 24.785 ms, StdErr = 0.005 ms (0.02%), N = 13, StdDev = 0.019 ms
Min = 24.739 ms, Q1 = 24.777 ms, Median = 24.789 ms, Q3 = 24.794 ms, Max = 24.820 ms
IQR = 0.018 ms, LowerFence = 24.750 ms, UpperFence = 24.821 ms
ConfidenceInterval = [24.763 ms; 24.808 ms] (CI 99.9%), Margin = 0.023 ms (0.09% of Mean)
Skewness = -0.63, Kurtosis = 3.75, MValue = 2
-------------------- Histogram --------------------
[24.728 ms ; 24.831 ms) | @@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Job-TWEPVA : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Toolchain=5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.79 ms | 0.023 ms | 0.019 ms | - | - | - | 45 B |
I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that #define PROFILER
in the main file of the benchmark project.
With that, I got the following:
VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the [MemoryDiagnoser]
has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔
Additional info
If you look at my [GlobalSetup]
method (specifically, these lines), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:
- If I remove those warmup iterations completely and let
BenchmarkDotNet
handle that, the benchmark results go completely off the rails, and I get 1 KB of reported memory allocations (?!). - If I only do a single benchmark invocation as warmup (without that loop and without also calling
GC.Collect
, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).
Benchmark results for point 1. (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-HQFLUQ(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 18.801 ms, StdErr = 0.034 ms (0.18%), N = 15, StdDev = 0.133 ms
Min = 18.618 ms, Q1 = 18.725 ms, Median = 18.746 ms, Q3 = 18.893 ms, Max = 19.014 ms
IQR = 0.168 ms, LowerFence = 18.472 ms, UpperFence = 19.145 ms
ConfidenceInterval = [18.658 ms; 18.943 ms] (CI 99.9%), Margin = 0.143 ms (0.76% of Mean)
Skewness = 0.38, Kurtosis = 1.69, MValue = 2
-------------------- Histogram --------------------
[18.547 ms ; 19.085 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Job-HQFLUQ : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Toolchain=5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 18.80 ms | 0.143 ms | 0.133 ms | - | - | - | 1 KB |
Benchmark results for point 2. (click to expand):
// * Summary *
BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
[Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Job-OZIUOW : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Toolchain=5.0
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.82 ms | 0.068 ms | 0.063 ms | - | - | - | 58 B |
I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the [GlobalSetup]
helps BenchmarkDotNet
ignore that first big outlier while running the actual benchmarks. I still don't get though:
- Why is the memory reporting apparently incorrect and not in line with VS' memory profiler? That is, even when I have
[GlobalSetup]
do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B? - Why does the reported memory usage seem influenced by what I do in
[GlobalSetup]
?
Thanks! 😄