[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work

Hi, I've been trying to use `BenchmarkDotNet` to profile the memory usage improvements in a new version of [`ComputeSharp`](https://github.com/Sergio0694/ComputeSharp), but I'm struggling to make sense of the reported memory usage, and I'm wondering whether I might be doing something wrong or whether there's some issues/caveats with `[MemoryDiagnoser]`, as the reported memory allocations seem a bit off.

All the code below and results are from the `investigation/bdn` branch in the `ComputeSharp` repo, for reference.

### Repro steps

- Add the CI `nuget.config` file for `BenchmarkDotNet` as explained in [this comment](https://github.com/dotnet/BenchmarkDotNet/issues/1535#issuecomment-694093258)
- Clone the repo, checkout to `investigation/bdn`
- Build `ComputeSharp.Benchmark` in Release, run the benchmark as usual with `dotnet ComputeSharp.Benchmark`

### Details

Running that benchmark gives me the following:

<details>
 <summary>Benchmark results (click to expand):</summary>
 

```
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-TWEPVA(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 24.785 ms, StdErr = 0.005 ms (0.02%), N = 13, StdDev = 0.019 ms
Min = 24.739 ms, Q1 = 24.777 ms, Median = 24.789 ms, Q3 = 24.794 ms, Max = 24.820 ms
IQR = 0.018 ms, LowerFence = 24.750 ms, UpperFence = 24.821 ms
ConfidenceInterval = [24.763 ms; 24.808 ms] (CI 99.9%), Margin = 0.023 ms (0.09% of Mean)
Skewness = -0.63, Kurtosis = 3.75, MValue = 2
-------------------- Histogram --------------------
[24.728 ms ; 24.831 ms) | @@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
 [Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
 Job-TWEPVA : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.79 ms | 0.023 ms | 0.019 ms | - | - | - | 45 B |
```
</details>

I was confused about those 45B of allocations (after the initial warmup, the benchmark should do no allocations, in theory). So I run the VS memory profiler to have a look (just uncomment that `#define PROFILER` in the main file of the benchmark project. 
With that, I got the following:

![image](https://user-images.githubusercontent.com/10199417/99664974-486d7e00-2a69-11eb-8cca-cb804d28764a.png)

<details>
 <summary>To double-check, also used dotMemory (click to expand):</summary>
 

![image](https://user-images.githubusercontent.com/10199417/99676971-7a86dc00-2a79-11eb-81aa-6ec9b9ca6209.png)
</details>

VS reports no allocations at all while running the benchmark code in a loop, so I'm very confused about those 45 B reported by BDN. I know that the `[MemoryDiagnoser]` has a reported accuracy of 99.5%, but I figured the difference between no allocations at all, and 45 bytes, could be considered not within that threshold? 🤔

### Additional info

If you look at my `[GlobalSetup]` method (specifically, [these lines](https://github.com/Sergio0694/ComputeSharp/blob/f307dd26d95be8da3b0d0dea427404ff9d7938ce/samples/ComputeSharp.Benchmark/DnnBenchmark.cs#L97-L111)), you will see I'm manually doing a whole bunch of warmup iterations and GC collections from there. Turns out that:

1. If I remove those warmup iterations completely and let `BenchmarkDotNet` handle that, the benchmark results go completely off the rails, and I get **1 KB** of reported memory allocations (?!).
2. If I only do a single benchmark invocation as warmup (without that loop and without also calling `GC.Collect`, the reported allocations are more in line with the results above, but still a bit worse (58 B insteaad of 45).

<details>
 <summary>Benchmark results for point 1. (click to expand):</summary>
 

```
// * Detailed results *
DnnBenchmark.GpuWithNoTemporaryBuffers: Job-HQFLUQ(Toolchain=5.0)
Runtime = .NET 5.0.0 (5.0.20.51904), X64 RyuJIT; GC = Concurrent Workstation
Mean = 18.801 ms, StdErr = 0.034 ms (0.18%), N = 15, StdDev = 0.133 ms
Min = 18.618 ms, Q1 = 18.725 ms, Median = 18.746 ms, Q3 = 18.893 ms, Max = 19.014 ms
IQR = 0.168 ms, LowerFence = 18.472 ms, UpperFence = 19.145 ms
ConfidenceInterval = [18.658 ms; 18.943 ms] (CI 99.9%), Margin = 0.143 ms (0.76% of Mean)
Skewness = 0.38, Kurtosis = 1.69, MValue = 2
-------------------- Histogram --------------------
[18.547 ms ; 19.085 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
 [Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
 Job-HQFLUQ : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 18.80 ms | 0.143 ms | 0.133 ms | - | - | - | 1 KB |
```
</details>

<details>
 <summary>Benchmark results for point 2. (click to expand):</summary>
 

```
// * Summary *

BenchmarkDotNet=v0.12.1.1466-nightly, OS=Windows 10.0.19041.610 (2004/May2020Update/20H1)
AMD Ryzen 7 2700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=5.0.100
 [Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
 Job-OZIUOW : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Toolchain=5.0

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------------------------- |---------:|---------:|---------:|------:|------:|------:|----------:|
| GpuWithNoTemporaryBuffers | 24.82 ms | 0.068 ms | 0.063 ms | - | - | - | 58 B |
```
</details>

I should note that my initial run of a benchmark is particularly heavy, as the library needs to generate and compile a GPU shader. Further invocations will use the cached data and will be much faster (as in, going down from over 1s to like 24ms in this benchmark). So I guess running the benchmark at least once in the `[GlobalSetup]` helps `BenchmarkDotNet` ignore that first big outlier while running the actual benchmarks. I still don't get though:

- Why is the memory reporting apparently incorrect and not in line with VS' memory profiler? That is, even when I have `[GlobalSetup]` do a whole bunch of warmup iterations and GC collections, for good measure. What are those 45 B?
- Why does the reported memory usage seem influenced by what I do in `[GlobalSetup]`?

Thanks! 😄

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Repro steps

Details

Additional info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[MemoryDiagnoser] inaccurate and influenced by [GlobalSetup] work #1599

Description

Repro steps

Details

Additional info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions