Skip to content

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Closed as not planned
@Sergio0694

Description

@Sergio0694

Hi, I've been trying to use BenchmarkDotNet to profile the memory usage improvements in a new version of ComputeSharp, but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with [MemoryDiagnoser], as the reported memory allocations seem a bit off.

All the code below and results are from the investigation/bdn branch in the ComputeSharp repo, for reference.

Repro steps

  • Add the CI nuget.config file for BenchmarkDotNet as explained in this comment
  • Clone the repo, checkout to investigation/bdn
  • Build ComputeSharp.Benchmark in Release, run the benchmark as usual with dotnet ComputeSharp.Benchmark

Details

Running that benchmark gives me the following:

Benchmark results (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-TWEPVA(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 24.785 ms, StdErr = 0.005 ms (0.02%), N = 13, StdDev = 0.019 ms
Min = 24.739 ms, Q1 = 24.777 ms, Median = 24.789 ms, Q3 = 24.794 ms, Max = 24.820 ms
IQR = 0.018 ms, LowerFence = 24.750 ms, UpperFence = 24.821 ms
ConfidenceInterval = [24.763 ms; 24.808 ms] (CI 99.9%), Margin = 0.023 ms (0.09% of Mean)
Skewness = -0.63, Kurtosis = 3.75, MValue = 2
-------------------- Histogram --------------------
[24.728 ms ; 24.831 ms) | @@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-TWEPVA : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.79 ms | 0.023 ms | 0.019 ms |     - |     - |     - |      45 B |

I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that #define PROFILER in the main file of the benchmark project.
With that, I got the following:

image

To double-check, also used dotMemory (click to expand):

image

VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the [MemoryDiagnoser] has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔

Additional info

If you look at my [GlobalSetup] method (specifically, these lines), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:

  1. If I remove those warmup iterations completely and let BenchmarkDotNet handle that, the benchmark results go completely off the rails, and I get 1 KB of reported memory allocations (?!).
  2. If I only do a single benchmark invocation as warmup (without that loop and without also calling GC.Collect, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).
Benchmark results for point 1. (click to expand):
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-HQFLUQ(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 18.801 ms, StdErr = 0.034 ms (0.18%), N = 15, StdDev = 0.133 ms
Min = 18.618 ms, Q1 = 18.725 ms, Median = 18.746 ms, Q3 = 18.893 ms, Max = 19.014 ms
IQR = 0.168 ms, LowerFence = 18.472 ms, UpperFence = 19.145 ms
ConfidenceInterval = [18.658 ms; 18.943 ms] (CI 99.9%), Margin = 0.143 ms (0.76% of Mean)
Skewness = 0.38, Kurtosis = 1.69, MValue = 2
-------------------- Histogram --------------------
[18.547 ms ; 19.085 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-HQFLUQ : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 18.80 ms | 0.143 ms | 0.133 ms |     - |     - |     - |      1 KB |
Benchmark results for point 2. (click to expand):
// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  Job-OZIUOW : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

|                    Method |     Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.82 ms | 0.068 ms | 0.063 ms |     - |     - |     - |      58 B |

I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the [GlobalSetup] helps BenchmarkDotNet ignore that first big outlier while running the actual benchmarks. I still don't get though:

  • Why is the memory reporting apparently incorrect and not in line with VS' memory profiler? That is, even when I have [GlobalSetup] do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B?
  • Why does the reported memory usage seem influenced by what I do in [GlobalSetup]?

Thanks! 😄

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions