Description
Description
I'm from RavenDB team and would like to share a report about potentially peculiar behavior of managed memory allocations. In particular it's about the size, utilization and fragmentation of Large Object Heap.
We observe the following issue with one of the following server in RavenDB Cloud. I assume it must be specific to type of queries this instance is handling as it's not typical pattern that we see somewhere else. The server is crashing due to OOM, at that time we see high managed memory allocation (yellow on below graph):
Nov 05 18:19:32 vmf8905577d5 systemd[1]: ravendb.service: A process of this unit has been killed by the OOM killer.
Nov 05 18:19:35 vmf8905577d5 systemd[1]: ravendb.service: Main process exited, code=killed, status=9/KILL
...
Nov 05 18:19:36 vmf8905577d5 systemd[1]: ravendb.service: Failed with result 'oom-kill'.
The server has steady load of ~100 - 150 request/sec for 24/7. We can see the same pattern of managed allocations. We use the following GC settings:
"configProperties": {
"System.GC.Concurrent": true,
"System.GC.Server": true,
"System.GC.RetainVM": true,
Our first step was to check what takes that space so we collected dotMemory
dump for analysis of managed memory when it was over 25GB. We found there that most of the memory is taken by LOH. Although out of 24GB that it occupied only 136MB was used:
When looking at the LOH details in dotMemory
we can see that most of LOH consist of empty areas (no objects allocated, 0% Utilization, 0% Fragmentation)
We do have the mechanism where we force the LOH compaction from time to time during background GC (once it's gets bigger that 25% of physical memory):
As next step we tried to experiment with some GC settings, collect GC info with the usage of dotnet-trace
and check info provided by GC.GetGCMemoryInfo()
.
We found that with the following settings DOTNET_GCConserveMemory=8
(so LOH is compacted if it has too much fragmentation) and "System.GC.RetainVM": false
the system behaves stable. We no longer experience oom-killer. The memory usage graph looks as follow for the last couple of days:
Although we're still wondering whether sizes of LOH we see there are expected. We don't see LOH of this size anywhere else. We do know that allocations on LOH are mostly due to heavy usage of arrays by Lucene search engine (https://github.com/ravendb/lucenenet).
Although we have also noticed that when the size of LOH is very big, the fragmentation reported for it is small (based on info returned by GC.GetGCMemoryInfo()
or from .nettrace
file). For example we could see 0% fragmenation of LOH (Induced GC were forced intentionally during dotnet-trace
recording):
But shortly after that we collected dotMemory
snapshot, where same as above we see almost empty LOH
We also have another dotnet-trace
session where we started to collect GC info (--profile gc-verbose
) when the LOH was:
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "56.23 MBytes",
"FragmentationBeforeHumane": "56.23 MBytes",
"SizeAfterHumane": "24.44 GBytes",
"SizeBeforeHumane": "24.44 GBytes"
after 1h 20m it was:
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "50.44 MBytes",
"FragmentationBeforeHumane": "50.44 MBytes",
"SizeAfterHumane": "28.99 GBytes",
"SizeBeforeHumane": "28.99 GBytes"
and eventually the size got reduced to 10GB (and 9GB of fragmentation):
"GenerationName": "Large Object Heap",
"FragmentationAfterHumane": "9.05 GBytes",
"FragmentationBeforeHumane": "9.05 GBytes",
"SizeAfterHumane": "10.25 GBytes",
"SizeBeforeHumane": "10.25 GBytes"
In PerfView it's seen as:
Then we collected dotMemory
snapshot again and LOH got reduced to 3.14GB (due to GC forced by dotMemory
so LOH got compacted as I understand):
We can provide .nettrace
, dotMemory
snapshot files privately if that would be helpful.
Reproduction Steps
It's reproduced only on a specific instance. We cannot provided access but we're able to collect debug assets.
Expected behavior
LOH will not reach the size of >20GB of memory
Actual behavior
LOH reaches >20 GB of memory, while fragmentation and utilization is very low
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response