Memory Randomization

## The Problem

The problem that we are facing with our current setup are benchmarks that:

- have a flat distribution when run just once locally because the memory is allocated just once and BenchmarkDotNet uses the same input for all the iterations in a given run.
- have a multimodal distribution when considering historical data from multiple runs because the memory alignment changes over time and we don't control it. Example: today we are sorting an aligned array, tomorrow an unaligned one, and the reported time is different.

## Goal

The goal of the study was to see how randomizing memory alignment could improve long-term benchmark stability. What might seem to be surprising is that we don't want to get a flat distribution when running locally, but get a full picture for all possible scenarios (aligned and not aligned).

The idea comes from [stabilizer](https://github.com/ccurtsinger/stabilizer/blob/master/README.md) project, which was suggested as a possible solution by @AndyAyersMS a few years ago when we started discussing this problem.

## Implementation

The "randomization" has been implemented as a new, optional feature in BenchmarkDotNet in the following way:

- allocate a random-size stack memory at the beginning of every iteration (idea comes from @AndyAyersMS, [code](https://github.com/dotnet/BenchmarkDotNet/blob/7caa3c500ba7ae225d82da5baf1fc07d27e917c6/src/BenchmarkDotNet/Engines/Engine.cs#L167)) and keep it alive for the iteration period
- between every iteration:
    - call the `[GlobalCleanup]` method that should be disposing of all resources ([code](https://github.com/dotnet/BenchmarkDotNet/blob/7caa3c500ba7ae225d82da5baf1fc07d27e917c6/src/BenchmarkDotNet/Engines/Engine.cs#L225))
    - allocate a random-size small byte array (Gen 0 object) (idea comes from @jkotas, [code](https://github.com/dotnet/BenchmarkDotNet/blob/7caa3c500ba7ae225d82da5baf1fc07d27e917c6/src/BenchmarkDotNet/Engines/Engine.cs#L227))
    - allocate a random-size large byte array (LOH object) (idea comes from @AndyAyersMS, [code](https://github.com/dotnet/BenchmarkDotNet/blob/7caa3c500ba7ae225d82da5baf1fc07d27e917c6/src/BenchmarkDotNet/Engines/Engine.cs#L228))
    - call the `[GlobalSetup]` method that should be allocating all memory (while both arrays are kept alive) ([code](https://github.com/dotnet/BenchmarkDotNet/blob/7caa3c500ba7ae225d82da5baf1fc07d27e917c6/src/BenchmarkDotNet/Engines/Engine.cs#L232))

All the changes that have been required to introduce this feature in BenchmarkDotNet can be seen [here](https://github.com/dotnet/BenchmarkDotNet/pull/1587)

To make the existing benchmarks work with the new feature, the following changes were required:

- implement missing `[GlobalCleanup]` methods: #1586. So far `[GlobalSetup]` was called only once and since every benchmark was running in a dedicated process and OS was cleaning all resources after process exit, the lack of proper resources cleanup was not a problem.
- move fields initialization from constructors to `[GlobalSetup]` methods to make it possible to reallocate them: #1587
- split big setups into smaller, dedicated ones to allocate as little as possible to increase the chance for having randomized input (by allocating only what we need for the given benchmark right after the random-size arrays): #1587
- fix some bugs that occurred when `[GlobalSetup]` methods were invoked more than once: #1587

## Methodology

The following methodology was used:

- run existing benchmarks with randomization disabled (default setting) using .NET 5 RTM.
- run existing benchmarks with randomization enabled using .NET 5 RTM and the same hardware and OS.
- use a modified version of [ResultsComparer](https://github.com/dotnet/performance/tree/master/src/tools/ResultsComparer) that would be searching for benchmarks that meet the following criteria:
    - the performance has changed (improved or regressed). I've used a threshold of `5%` and a `1 ns` noise filter.
    - the distribution for randomized results is multimodal (this is the expected randomization effect)
- re-run the benchmarks reported by ResultsComparer and filter out the benchmarks that are simply unstable

## Observations

### Managed memory buffers

The Randomization has a very strong effect on benchmarks that use continuous managed memory buffers like arrays or spans for input and perform some simple operations on them. Example:

https://github.com/dotnet/performance/blob/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5/src/benchmarks/micro/libraries/System.Memory/Span.cs#L74-L75

With Randomization disabled we are always getting a flat distribution:


```ini
-------------------- Histogram --------------------
[19.319 ns ; 19.739 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
---------------------------------------------------
```

With Randomization enabled, the distribution becomes multimodal:


```ini
-------------------- Histogram --------------------
[17.531 ns ; 19.318 ns) | @@@@@@@@@
[19.318 ns ; 20.833 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@
[20.833 ns ; 22.428 ns) | @
[22.428 ns ; 23.943 ns) |
[23.943 ns ; 25.662 ns) |
[25.662 ns ; 27.177 ns) | @@@
---------------------------------------------------
```

Which better represents what we can see in the [historical data](https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmaster_x64_Windows%2010.0.18362%2fSystem.Memory.Span(Char).IndexOfAnyThreeValues(Size%3a%20512).html):

![obraz](https://user-images.githubusercontent.com/6011991/99545898-22d36c80-29b6-11eb-80a9-8db1a8c6e5db.png)

The reporting system is currently using the average of all iterations for a given run to represent it as a single point on the chart. If we **switch to Median, we should be able to flatten the charts**. We can't display every iteration result as a separate point because we run every benchmark multiple times per day (each run gives us 15-20 iteration results) and existing charting libraries can't handle this amount of data (showing few hundreds data points per day for a period of a year (current .NET release cycle)).

The more inputs are used, the bigger the difference. The best example is copying one array to another:

https://github.com/dotnet/performance/blob/2bc578a6b60bf794d58669e94cc8bcd30cbbf5a8/src/benchmarks/micro/libraries/System.Collections/CopyTo.cs#L35-L36

Randomization disabled:

```ini
-------------------- Histogram --------------------
[87.431 ns ; 89.140 ns) | @@@@@@@@@@@@@
---------------------------------------------------
```

```ini
-------------------- Histogram --------------------
[ 79.009 ns ; 183.239 ns) | @@@@@@@@@@
[183.239 ns ; 287.047 ns) | @
[287.047 ns ; 391.277 ns) | @@@@@@@@
[391.277 ns ; 501.821 ns) | @
---------------------------------------------------
```

We have few modes here: none|all arrays aligned and source|desitnation only aligned.

### Stack memory

Allocating random-size stack memory affects CPU bound benchmarks that don't use any managed memory as an input. An example:

https://github.com/dotnet/performance/blob/2bc578a6b60bf794d58669e94cc8bcd30cbbf5a8/src/benchmarks/micro/libraries/System.Numerics.Vectors/Perf_Matrix4x4.cs#L105-L106

Before this change, a single run with BenchmarkDotNet would be always producing a flat distribution:

```ini
-------------------- Histogram --------------------
[17.139 ns ; 17.937 ns) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
---------------------------------------------------
```

But with randomization enabled, the distribution is not flat anymore:

```ini
-------------------- Histogram --------------------
[17.313 ns ; 17.859 ns) | @@@@@@@@@@@@@@@@@@@@@@
[17.859 ns ; 18.315 ns) | @@@@@@@@@@
[18.315 ns ; 19.027 ns) | @@@
---------------------------------------------------
```

And is much closer to what we can see in the [historical data](https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmaster_x64_Windows%2010.0.18362%2fSystem.Numerics.Tests.Perf_Matrix4x4.CreateRotationZBenchmark.html):

The benchmark seems to be stable from a high-level perspective:

![obraz](https://user-images.githubusercontent.com/6011991/99546443-b0af5780-29b6-11eb-8723-6bf43becf9b1.png)

But has a `0.5-1 ns` variance when we zoom-in:

![obraz](https://user-images.githubusercontent.com/6011991/99546323-9bd2c400-29b6-11eb-88c5-03b756e2701a.png)


### Not a Silver Bullet

There are some unstable benchmarks like [System.Collections.CreateAddAndClear<Int32>.Stack](https://github.com/dotnet/performance/blob/8aed638c9ee65c034fe0cca4ea2bdc3a68d2a6b5/src/benchmarks/micro/libraries/System.Collections/CreateAddAndClear.cs#L179-L189)
where randomizing memory does not help. We still have a flat distribution, but the reported time slightly changes (most probably due to not having a perfectly warmed up memory or just being dependent on code alignment):

Before:

```ini
-------------------- Histogram --------------------
[1.846 us ; 1.960 us) | @@@@@@@@@@@@@@@
---------------------------------------------------
```

After:

```ini
-------------------- Histogram --------------------
[2.087 us ; 2.204 us) | @@@@@@@@@@@@@@@
---------------------------------------------------
```

### BenchmarkDotNet

For a given distribution:

```ini
-------------------- Histogram --------------------
[ 79.009 ns ; 183.239 ns) | @@@@@@@@@@
[183.239 ns ; 287.047 ns) | @
[287.047 ns ; 391.277 ns) | @@@@@@@@
[391.277 ns ; 501.821 ns) | @
---------------------------------------------------
```

The `391.277 ns ; 501.821 ns` bucket would be typically recognized by BenchmarkDotNet as an upper outlier and removed. So **for benchmarks with randomization enabled, the outlier removal is disabled**.

The performance repo is configured to run up to 20 iterations. To get a full picture, we need to run more iterations for benchmarks with randomization enabled.

Moreover, the default BenchmarkDotNet heuristic stops the benchmarking when the results are stable or it has executed 100 iterations. This is why this setting should not be enabled in BenchmarkDotNet by default.


## Summary

Randomization is not a silver bullet that solves all our problems and it should not be enabled by default. 

It can help a lot with unstable benchmarks that use continuous managed memory buffers like arrays or spans for input and perform some simple operations on them. Examples:

- `System.Collections.CopyTo<*>.Array|Span|List`
- `System.Memory.Span<*>.IndexOfAnyThreeValues|SequenceEqual|StartsWith|EndsWith|LastIndexOfValue`
- `System.Collections.Contains<*>.Array|Span|List`
- `System.Memory.SequenceReader.TryReadTo`
- `System.IO.Tests.Perf_FileStream.ReadByte`

So far CPU-bound benchmarks with a `0-1 ns` variance between runs were not unstable enough to cause a problem to the auto-filing bot and I believe that we should not enable this feature for them until it becomes a problem.

## Proposed order of actions

@DrewScoggins @AndyAyersMS  @kunalspathak @tannergooding if you agree with my findings then these are the steps that need to be taken to enable the randomization: (I can do all of them except the last one)

- [x] merge the PRs that adds missing `[GlobalCleanups]` #1586
- [x] review and merge the PRs that simplifies existing `[GlobalSetup]` and moves initialization logic from constructors and field initializers to the setups: #1587
- [x] review and merge the PR that adds Randomization as a feature to BenchmarkDotNet: https://github.com/dotnet/BenchmarkDotNet/pull/1587
- [x] refresh, review and merge the PR that allows for applying BDN attributes per method (not just per class): https://github.com/dotnet/BenchmarkDotNet/pull/1097
- [ ] enable Randomization for `Array` and `Span` based benchmarks (only for those where it helps), update BDN version
- [ ] change the Reporting System to use `Medians` instead of `Averages`

cc @andreyakinshin @billwert @Lxiamail @jeffhandley @danmosemsft 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Randomization #1602

The Problem

Goal

Implementation

Methodology

Observations

Managed memory buffers

Stack memory

Not a Silver Bullet

BenchmarkDotNet

Summary

Proposed order of actions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	[Benchmark]
	public int IndexOfAnyThreeValues() => new System.Span<T>(_emptyWithSingleValue).IndexOfAny(_notDefaultValue, _notDefaultValue, _notDefaultValue);

	[Benchmark]
	public void Array() => System.Array.Copy(_array, _destination, Size);

	[Benchmark]
	public Matrix4x4 CreateRotationZBenchmark() => Matrix4x4.CreateRotationY(PI / 2.0f);

Memory Randomization #1602

Description

The Problem

Goal

Implementation

Methodology

Observations

Managed memory buffers

Stack memory

Not a Silver Bullet

BenchmarkDotNet

Summary

Proposed order of actions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions