Description
Problem statement
Currently when throughput is quite high (in relation to the resources available) snapshots can run into OOM even when force snapshotting is set. The problem can be reproduced locally and it is because sort/dedupe step requires significant amount of memory (~40% of the memory of the total memory - see flamegraph below marked as (2)). When forcing a snapshot at 70% (for example), it needs 40% of that 70% to run a successful snapshot.
This flamegraph was produced with 1.63 MB/s with server running at 1GiB max memory and 2 cores of CPU and all the defaults (70% for forcing snapshot). It OOM'd after the first force snapshot and sort/dedupe step needed ~285 MiB for snapshotting ~700 MiB of data.
Proposed Solution
In order to avoid running into OOMs, we can either snapshot more often (either by increasing wal flush intervals or snapshot size) but that'd mean evicting data out of the query buffer which is not ideal. Current view is, when snapshotting
- check the available memory and set an upper bound that can be used in the sort/dedupe step.
- break down the number of rows such that it can fit within the upper bound.
- run through sort/dedupe and produce smaller parquet files.
- avoid running this step in parallel when forcing a snapshot as we're already short of memory.
There is still a possibility to overwhelm the system and run into OOM if the memory profile changes as the sort/dedupe step is running with a big increase in throughput as this step is running. However that would impact the system as a whole and not specific to just the snapshotting process. In this case it is up to the user to decide increasing resources required to accommodate the throughput.
Alternatives considered
- Running with lower wal flush intervals and/or snapshot size such that snapshots are taken more often thus avoiding the buildup to OOM. This will mean evicting data out of the queryable buffer which is not ideal
- Triggering force snapshots at a lesser percentage of memory (moving default to 50% or even 30% for example for higher throughputs). This again means evicting data out of the queryable buffer earlier
Both the approaches above trigger snapshot earlier and evict the data out of the buffer but they could potentially produce a bigger parquet file than the ones produced in proposed solution. However, these approaches will consistently produce smaller files always compare to the proposed solution as it only produces smaller files when forcing a snapshot (or more precisely when there isn't enough memory).
Additional context
The perf team had run into this issue and so have others and the usual workaround for working around resource consumption has been to lower the percentages to 50% or 30% to force a snapshot.