JIT: De-abstraction in .NET 10

# De-Abstraction

In .NET 10 we hope to further enhance the JIT's ability to remove abstraction overhead from code. 

## Stack Allocation Improvements

See https://github.com/dotnet/runtime/issues/104936

During .NET 10 we would like to implement 2-3 enhancements to stack allocation of ref class instances. Priority may be given to issues that enable array de-abstraction (see below).

* #104906
* #112250
* #110596
* #112527

## Delegate GDV Improvements

We currently only GDV instance delegates. We'd like to extend support to static delegates.

## PGO Improvements

We currently lose a bit of performance in R2R compiled methods with Tiered PGO, because the instrumented version of the method doesn't collect profile data for inlinees.

https://github.com/dotnet/runtime/issues/44372

## Inlining Improvements

We'd like to enable inlining of methods with EH.

https://github.com/dotnet/runtime/issues/108900

* #112968

## Array Enumeration De-Abstraction

#### Completed Work:

* https://github.com/dotnet/runtime/pull/108153
* https://github.com/dotnet/runtime/pull/108604
* https://github.com/dotnet/runtime/pull/108771
* https://github.com/dotnet/runtime/pull/109182
* https://github.com/dotnet/runtime/pull/109190
* https://github.com/dotnet/runtime/pull/109209
* https://github.com/dotnet/runtime/pull/109237
* https://github.com/dotnet/runtime/pull/109256
* https://github.com/dotnet/runtime/pull/111473

#### Todo:

* fix cases where deabstraction analysis is blocked by chained GDV. I was going to just inhibit chaining for GDVs where the local being tested is an enumerator var, but in cases where there are back-to-back exact-same GDVs this seems like it often prevents us from realizing the second GDV is redundant (jump threading likely not powerful enough to clean things up). So a better fix is to try and tolerate chained GDV code... [See examples below](https://github.com/dotnet/runtime/issues/108913#issuecomment-2645859855)
* enable full set of loop opts for enumerator loops (when enumerator is stack allocated & promoted)
* for enumerable types that have the "wrong" type empty enumerator (like `List<T>`), when PGO shows that the empty case is the common case, GDV will guess for the empty enumerator type, but the allocation will be of a different type, and so conditional escape analysis won't kick in. If there's a choice between an allocated type and some other type in an enumerator GDV, consider guessing for the allocated type. [See example below](https://github.com/dotnet/runtime/issues/108913#issuecomment-2641732085).
* along those lines, consider generalizing the empty collection optimization repair so we get full optimization in cases like the above. Or at least also special case for `List<T>`. Note this happens after stack allocation since we only want to do it if we stack allocate, so it won't fix the problem above.

### Background

[a more complete writeup is now available [here](https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/DeabstractionAndConditionalEscapeAnalysis.md)]

The goal of this work is to (to the best of our ability) eliminate the **abstraction penalty** for cases where an `IEnumerable<T>` is iterated via `foreach`, and the underlying collection is an array (or is very likely to be an array).

In previous releases we've built a number of optimizations that can reduce abstraction overhead. But there is still a lot of room for improvement, especially in cases like the above, where the abstraction pattern involves several abstract objects acting in concert.

### What is the Abstraction Penalty?

Consider the following pair of benchmark methods that both sum up an integer array:
```C#
    static readonly int[] s_ro_array;

    public int foreach_static_readonly_array()
    {
        int sum = 0;
        foreach (int i in s_ro_array) sum += i;
        return sum;
    }

    public int foreach_static_readonly_array_via_interface()
    {
        IEnumerable<int> e = s_ro_array;
        int sum = 0;
        foreach (int i in e) sum += i;
        return sum;
    }
```
These two methods do the exact same computation, yet benchmarking shows the second method takes 4.5x as long as the first (with 512 element arrays, using very early .NET 10 bits incorporating #108604 and #108153):                                 

| Method                                               | Mean      | Ratio | 
|----------------------------------------------------- |----------:|------:|
| foreach_static_readonly_array                        | 147.69 ns |  1.00 |
| foreach_static_readonly_array_via_interface          | 665.28 ns |  4.50 |

This sort of overhead from an abstract presentation of computation is commonly known as the *abstraction penalty*. 

Note things used to be far worse; .NET 6's ratio here is 12.6.

| Method                                      | Runtime   | Mean       | Allocated |  Ratio |
|-------------------------------------------- |---------  |-----------:|----------:| ------:|
| foreach_static_readonly_array               | .NET 10.0 |   149.5 ns |         - |   1.00 |
| foreach_static_readonly_array_via_interface | .NET 10.0 |   665.1 ns |         - |   4.45 |
| foreach_static_readonly_array_via_interface | .NET 9.0  |  830.2 ns  |      32 B |   5.55 |
| foreach_static_readonly_array_via_interface | .NET 8.0  |   951.5 ns |      32 B |   6.36 |
| foreach_static_readonly_array_via_interface | .NET 6.0  | 1,896.7 ns |      32 B |  12.69 |

### Why is there an abstraction penalty? 

The IL generated for the `foreach_static_readonly_array_via_interface` is expressed in the shape of the abstract enumeration pattern: first `e.GetEnumerator()` is called on the abstract collection to produce an abstract enumerator, and then loop iterates via `MoveNext()` and `get_Current()` interface calls on this enumerator, all wrapped in a try finally to properly dispose the enumerator should an exception arise.

Seeing through all this to the actual simple computation going on in the loop requires a surprising amount of optimization machinery. In past releases we've built many of the necessary pieces, and now it's time to get them all working together to remove the remaining overhead.

In particular we need to leverage:
* Tiered compilation, and in particular dynamic PGO
* Guarded Devirtualization
* Object Stack Allocation
* Loop Cloning
* Physical Promotion

More generally the JIT will need to rely on PGO to determine the (likely) underlying type for the collection.

### Why focus on Arrays?

Arrays are the most common and also the simplest collection type. Assuming all goes well we may try and stretch the optimization to cover Lists.

### What needs to be done?

When the collection is an array, the enumerator is an instance of the ref class `SZGenericArrayEnumerator<T>`. Thanks to #108153 we can devirtualize (under guard) and inline the enumerator constructor, and devirtualize and inline calls on the enumerator. And in some cases we can even stack allocate the enumerator (note in the table above, .NET 10 no longer has allocations for the benchmarks).

Current inner loop codegen:
```asm
;; foreach_static_readonly_array 

       align    [8 bytes for IG03]
						;; size=32 bbWeight=1.00 PerfScore 3.24
G_M1640_IG03:  ;; offset=0x0020
       add      eax, dword ptr [rcx]
       add      rcx, 4
       dec      edx
       jne      SHORT G_M1640_IG03

;; foreach_static_readonly_array_via_interface (rdx is the enumerator object)
;; both enumerator and array access do bounds checks, even though array size is "known"

G_M36467_IG03:  ;; offset=0x0053
       mov      r8, gword ptr [rdx+0x10]
       cmp      ecx, dword ptr [r8+0x08]
       jae      SHORT G_M36467_IG09
       mov      ecx, ecx
       add      eax, dword ptr [r8+4*rcx+0x10]
						;; size=17 bbWeight=513.82 PerfScore 4752.82
G_M36467_IG04:  ;; offset=0x0064
       mov      ecx, dword ptr [rdx+0x08]
       inc      ecx
       mov      ebx, dword ptr [rdx+0x0C]
       cmp      ecx, ebx
       jae      SHORT G_M36467_IG06
						;; size=12 bbWeight=514.82 PerfScore 2831.50
G_M36467_IG05:  ;; offset=0x0070
       mov      dword ptr [rdx+0x08], ecx
       mov      ecx, dword ptr [rdx+0x08]
       cmp      ecx, dword ptr [rdx+0x0C]
       jb       SHORT G_M36467_IG03
```

However, we cannot yet fully optimize the enumeration loop:
* We may fail to prove the enumerator can't escape (and so can't stack allocate)
* Even if we can prove the enumerator can't escape, we may fail to stack allocate (the allocation site may be in a loop)
* Even if we stack allocate the enumerator, we may think it is address exposed (and so fail to promote). There are several sub-problems here:
  * If we're able to learn the enumerator type w/o GDV (as in the examples above), we run into the complication that the `SZGenericArrayEnumerator` constructor has an optimization for empty arrays, where instead of constructing a new enumerator instance, it returns a static instance. So at the enumerator use sites there is some ambiguity about which object is enumerating. For cases where the array length is known this ambiguity gets resolved, but too late in the phase order.
  * If we rely on GDV, then there are *three* reaching definitions, and (as far as we know) even the unknown collection type could produce an instance of `SZGenericArrayEnumerator`, so all three definitions can reach through the enumerator GDV tests (see https://github.com/dotnet/runtime/pull/108153#issuecomment-2374886783 for a picture and more notes). And we may get confused by the try/finally or try/fault which will also contain a reference to the enumerator (for GDV).

While these may seem like small problems, the solutions are not obvious. Either we need to disentangle the code paths for each possibility early (basically do an early round of cloning, not just of the enumeration loop but of all the code from the enumerator creation sites to the last use of the enumerator and possibly the EH regions) or we need to think about making our escape and address propagation logic be flow-sensitive *and* contextual *and* introduce runtime disambiguation for the reaching values (that is, at each enumerator use site, we have to test if the enumerator is the just-allocated instance from the `SZGenericArrayEnumerator` that we hope to stack allocate and promote).

The cloning route seems more viable, but it is costly to duplicate all that code and we'll have to do it well in advance of knowing whether the rest of the optimizations pay off. So we might need to be able to undo it if there's no real benefit.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: De-abstraction in .NET 10 #108913

De-Abstraction

Stack Allocation Improvements

Delegate GDV Improvements

PGO Improvements

Inlining Improvements

Array Enumeration De-Abstraction

Completed Work:

Todo:

Background

What is the Abstraction Penalty?

Why is there an abstraction penalty?

Why focus on Arrays?

What needs to be done?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Mean	Ratio
foreach_static_readonly_array	147.69 ns	1.00
foreach_static_readonly_array_via_interface	665.28 ns	4.50

Method	Runtime	Mean	Allocated	Ratio
foreach_static_readonly_array	.NET 10.0	149.5 ns	-	1.00
foreach_static_readonly_array_via_interface	.NET 10.0	665.1 ns	-	4.45
foreach_static_readonly_array_via_interface	.NET 9.0	830.2 ns	32 B	5.55
foreach_static_readonly_array_via_interface	.NET 8.0	951.5 ns	32 B	6.36
foreach_static_readonly_array_via_interface	.NET 6.0	1,896.7 ns	32 B	12.69

JIT: De-abstraction in .NET 10 #108913

Description

De-Abstraction

Stack Allocation Improvements

Delegate GDV Improvements

PGO Improvements

Inlining Improvements

Array Enumeration De-Abstraction

Completed Work:

Todo:

Background

What is the Abstraction Penalty?

Why is there an abstraction penalty?

Why focus on Arrays?

What needs to be done?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions