Feature request: auto flush active memtable when there is many tombstones #13308
Description
If there happens to be lots of deletes for recently written data that is still in the memtable, it's possible to max out cpu usage of rocksdb when there is many prefix iterators that have to iterate over all the tombstones:
notice we don't have tombstones in SST files:
BUT we do have many tombstones in the active memtable:
This was the root cause of #13191 (comment)
Ideally we want something like CompactOnDeletionCollector but for memtables.
There is something very basic implemented for range tombstones: https://github.com/facebook/rocksdb/blob/v9.10.0/db/memtable.h#L835-L837 that triggers auto flush based on number of range tombstones that we can replicate and do the same for number of "regular" tombstones but ideally, we implement a similar semantic as CompactOnDeletionCollector to look at overall ratio of deleted/live keys as well as consecutive tombstones
@cbi42 what do you think? This seems like a very useful feature and I'm happy to at least implement the basic version of this similar to https://github.com/facebook/rocksdb/blob/v9.10.0/db/memtable.h#L835-L837 if you think my analysis is valid and there is no other feature in rocksdb that can do what I want.
The alternative approach:
If we don't want to bother with this, then in a background thread on an interval (every minute or so), I can get the value of rocksdb.num-deletes-active-mem-table
and compare that against rocksdb.num-entries-active-mem-table
to calculate the overall deletion/total ratio similar to what CompactOnDeletionCollector does for SST files and if that it it past the threshold then trigger a manual flush. I believe this should work but it feels odd to do something like this based on rocksdb metrics. Thoughts?