Skip to content

LOH increases its size until OOM #109774

Open
@arekpalinski

Description

@arekpalinski

Description

I'm from RavenDB team and would like to share a report about potentially peculiar behavior of managed memory allocations. In particular it's about the size, utilization and fragmentation of Large Object Heap.

We observe the following issue with one of the following server in RavenDB Cloud. I assume it must be specific to type of queries this instance is handling as it's not typical pattern that we see somewhere else. The server is crashing due to OOM, at that time we see high managed memory allocation (yellow on below graph):

Image

Nov 05 18:19:32 vmf8905577d5 systemd[1]: ravendb.service: A process of this unit has been killed by the OOM killer.
Nov 05 18:19:35 vmf8905577d5 systemd[1]: ravendb.service: Main process exited, code=killed, status=9/KILL
...
Nov 05 18:19:36 vmf8905577d5 systemd[1]: ravendb.service: Failed with result 'oom-kill'.

The server has steady load of ~100 - 150 request/sec for 24/7. We can see the same pattern of managed allocations. We use the following GC settings:

    "configProperties": {
      "System.GC.Concurrent": true,
      "System.GC.Server": true,
      "System.GC.RetainVM": true,

Our first step was to check what takes that space so we collected dotMemory dump for analysis of managed memory when it was over 25GB. We found there that most of the memory is taken by LOH. Although out of 24GB that it occupied only 136MB was used:

Image

When looking at the LOH details in dotMemory we can see that most of LOH consist of empty areas (no objects allocated, 0% Utilization, 0% Fragmentation)

Image

We do have the mechanism where we force the LOH compaction from time to time during background GC (once it's gets bigger that 25% of physical memory):

https://github.com/ravendb/ravendb/blob/80ba404673f502f2db71269ecba1a135e56e7c95/src/Sparrow/LowMemory/LowMemoryNotification.cs#L150-L154


As next step we tried to experiment with some GC settings, collect GC info with the usage of dotnet-trace and check info provided by GC.GetGCMemoryInfo().

We found that with the following settings DOTNET_GCConserveMemory=8 (so LOH is compacted if it has too much fragmentation) and "System.GC.RetainVM": false the system behaves stable. We no longer experience oom-killer. The memory usage graph looks as follow for the last couple of days:

Image


Although we're still wondering whether sizes of LOH we see there are expected. We don't see LOH of this size anywhere else. We do know that allocations on LOH are mostly due to heavy usage of arrays by Lucene search engine (https://github.com/ravendb/lucenenet).

Although we have also noticed that when the size of LOH is very big, the fragmentation reported for it is small (based on info returned by GC.GetGCMemoryInfo() or from .nettrace file). For example we could see 0% fragmenation of LOH (Induced GC were forced intentionally during dotnet-trace recording):

Image

But shortly after that we collected dotMemory snapshot, where same as above we see almost empty LOH

Image


We also have another dotnet-trace session where we started to collect GC info (--profile gc-verbose) when the LOH was:

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "56.23 MBytes",
"FragmentationBeforeHumane": "56.23 MBytes",
"SizeAfterHumane": "24.44 GBytes",
"SizeBeforeHumane": "24.44 GBytes"

after 1h 20m it was:

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "50.44 MBytes",
"FragmentationBeforeHumane": "50.44 MBytes",
"SizeAfterHumane": "28.99 GBytes",
"SizeBeforeHumane": "28.99 GBytes"

and eventually the size got reduced to 10GB (and 9GB of fragmentation):

"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "9.05 GBytes",
"FragmentationBeforeHumane": "9.05 GBytes",
"SizeAfterHumane": "10.25 GBytes",
"SizeBeforeHumane": "10.25 GBytes"

In PerfView it's seen as:

Image

Then we collected dotMemory snapshot again and LOH got reduced to 3.14GB (due to GC forced by dotMemory so LOH got compacted as I understand):

Image

We can provide .nettrace, dotMemory snapshot files privately if that would be helpful.

Reproduction Steps

It's reproduced only on a specific instance. We cannot provided access but we're able to collect debug assets.

Expected behavior

LOH will not reach the size of >20GB of memory

Actual behavior

LOH reaches >20 GB of memory, while fragmentation and utilization is very low

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions