LOH increases its size until OOM

### Description

I'm from [RavenDB](https://github.com/ravendb/ravendb) team and would like to share a report about potentially peculiar behavior of managed memory allocations. In particular it's about the size, utilization and fragmentation of Large Object Heap.

We  observe the following issue with one of the following server in RavenDB Cloud. I assume it must be specific to type of queries this instance is handling as it's not typical pattern that we see somewhere else. The server is crashing due to OOM, at that time we see high managed memory allocation (yellow on below graph):

![Image](https://github.com/user-attachments/assets/46a91fd7-6954-4a1d-b97f-3500aa174924)


```
Nov 05 18:19:32 vmf8905577d5 systemd[1]: ravendb.service: A process of this unit has been killed by the OOM killer.
Nov 05 18:19:35 vmf8905577d5 systemd[1]: ravendb.service: Main process exited, code=killed, status=9/KILL
...
Nov 05 18:19:36 vmf8905577d5 systemd[1]: ravendb.service: Failed with result 'oom-kill'.
```

The server has steady load of ~100 - 150 request/sec for 24/7. We can see the same pattern of managed allocations. We use the following GC settings:

```
    "configProperties": {
      "System.GC.Concurrent": true,
      "System.GC.Server": true,
      "System.GC.RetainVM": true,
```

Our first step was to check what takes that space so we collected `dotMemory` dump for analysis of managed memory when it was over 25GB. We found there that most of the memory is taken by LOH. Although out of 24GB that it occupied only 136MB was used:

![Image](https://github.com/user-attachments/assets/a1b44fde-3d4f-4f73-92ad-b0093add033a)


When looking at the LOH details in `dotMemory` we can see that most of LOH consist of empty areas (no objects allocated, 0% Utilization, 0% Fragmentation)

![Image](https://github.com/user-attachments/assets/9621677f-d791-49c1-8924-6b1175f61a16)

We do have the mechanism where we force the LOH compaction from time to time during background GC (once it's gets bigger that 25% of physical memory):

https://github.com/ravendb/ravendb/blob/80ba404673f502f2db71269ecba1a135e56e7c95/src/Sparrow/LowMemory/LowMemoryNotification.cs#L150-L154

---

As next step we tried to experiment with some [GC settings](https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector), collect GC info with the usage of `dotnet-trace` and check info provided by `GC.GetGCMemoryInfo()`.

We found that with the following settings `DOTNET_GCConserveMemory=8` (so LOH is compacted if it has too much fragmentation) and `"System.GC.RetainVM": false` the system behaves stable. We no longer experience oom-killer. The memory usage graph looks as follow for the last couple of days:

![Image](https://github.com/user-attachments/assets/c6bd9470-68bd-47be-9a2d-e04e290c7116)

---

Although we're still wondering whether sizes of LOH we see there are expected. We don't see LOH of this size anywhere else. We do know that allocations on LOH are mostly due to heavy usage of arrays by Lucene search engine (https://github.com/ravendb/lucenenet). 

Although we have also noticed that when the size of LOH is very big, the fragmentation reported for it is small (based on info returned by `GC.GetGCMemoryInfo()` or from `.nettrace` file). For example we could see 0% fragmenation of LOH (Induced GC were forced intentionally during `dotnet-trace` recording):

![Image](https://github.com/user-attachments/assets/01288462-3580-48e4-8125-4962b4bbbc8a)


But shortly after that we collected `dotMemory` snapshot, where same as above we see almost empty LOH

![Image](https://github.com/user-attachments/assets/9b366fe6-e105-414f-9feb-ed17f258be09)

---

We also have another `dotnet-trace` session where we started to collect GC info (` --profile gc-verbose `) when the LOH was:

```
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "56.23 MBytes",
"FragmentationBeforeHumane": "56.23 MBytes",
"SizeAfterHumane": "24.44 GBytes",
"SizeBeforeHumane": "24.44 GBytes"
```

after 1h 20m it was:

```
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "50.44 MBytes",
"FragmentationBeforeHumane": "50.44 MBytes",
"SizeAfterHumane": "28.99 GBytes",
"SizeBeforeHumane": "28.99 GBytes"
```

and eventually the size got reduced to 10GB (and 9GB of fragmentation):

```
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "9.05 GBytes",
"FragmentationBeforeHumane": "9.05 GBytes",
"SizeAfterHumane": "10.25 GBytes",
"SizeBeforeHumane": "10.25 GBytes"
```

In PerfView it's seen as:

![Image](https://github.com/user-attachments/assets/d6fe7276-b5b4-4dea-afa8-1ca2a9a88104)


Then we collected `dotMemory` snapshot again and LOH got reduced to 3.14GB (due to GC forced by `dotMemory` so LOH got compacted as I understand):

![Image](https://github.com/user-attachments/assets/4eea3506-f479-43c8-9136-2f68fd4e8189)

We can provide `.nettrace`, `dotMemory` snapshot files privately if that would be helpful.


### Reproduction Steps

It's reproduced only on a specific instance. We cannot provided access but we're able to collect debug assets. 

### Expected behavior

LOH will not reach the size of >20GB of memory

### Actual behavior

LOH reaches >20 GB of memory, while fragmentation and utilization is very low

### Regression?

_No response_

### Known Workarounds

_No response_

### Configuration

_No response_

### Other information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LOH increases its size until OOM #109774

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LOH increases its size until OOM #109774

Description

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions