Releases: facebook/rocksdb
Releases · facebook/rocksdb
RocksDB 8.3.2
8.3.2 (2023-06-14)
Bug Fixes
- Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (#11538)
8.3.1 (2023-06-07)
Performance Improvements
- Fixed higher read QPS during DB::Open() reading files created prior to #11406, especially when reading many small file (size < 52 MB) during DB::Open() and partitioned filter or index is used.
8.3.0 (2023-05-19)
New Features
- Introduced a new option
block_protection_bytes_per_key
, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287). - Added
JemallocAllocatorOptions::num_arenas
. Settingnum_arenas > 1
may mitigate mutex contention in the allocator, particularly in scenarios where block allocations commonly bypass jemalloc tcache. - Improve the operational safety of publishing a DB or SST files to many hosts by using different block cache hash seeds on different hosts. The exact behavior is controlled by new option
ShardedCacheOptions::hash_seed
, which also documents the solved problem in more detail. - Introduced a new option
CompactionOptionsFIFO::file_temperature_age_thresholds
that allows FIFO compaction to compact files to different temperatures based on key age (#11428). - Added a new ticker stat to count how many times RocksDB detected a corruption while verifying a block checksum:
BLOCK_CHECKSUM_MISMATCH_COUNT
. - New statistics
rocksdb.file.read.db.open.micros
that measures read time of block-based SST tables or blob files during db open. - New statistics tickers for various iterator seek behaviors and relevant filtering, as *
_LEVEL_SEEK_
*. (#11460)
Public API Changes
- EXPERIMENTAL: Add new API
DB::ClipColumnFamily
to clip the key in CF to a certain range. It will physically deletes all keys outside the range including tombstones. - Add
MakeSharedCache()
construction functions to various cache Options objects, and deprecated theNewWhateverCache()
functions with long parameter lists. - Changed the meaning of various Bloom filter stats (prefix vs. whole key), with iterator-related filtering only being tracked in the new *
_LEVEL_SEEK_
*. stats. (#11460)
Behavior changes
- For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware, but could hurt performance of the kCRC32c checksum, which is no longer the default, on some "portable" builds. See PR #11419 for details.
Bug Fixes
- Delete an empty WAL file on DB open if the log number is less than the min log number to keep
- Delete temp OPTIONS file on DB open if there is a failure to write it out or rename it
Performance Improvements
- Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch (#11406).
RocksDB 8.1.1
8.1.1 (2023-04-06)
Bug Fixes
- In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size.
8.1.0 (2023-03-18)
Behavior changes
- Compaction output file cutting logic now considers range tombstone start keys. For example, SST partitioner now may receive ParitionRequest for range tombstone start keys.
- If the async_io ReadOption is specified for MultiGet or NewIterator on a platform that doesn't support IO uring, the option is ignored and synchronous IO is used.
Bug Fixes
- Fixed an issue for backward iteration when user defined timestamp is enabled in combination with BlobDB.
- Fixed a couple of cases where a Merge operand encountered during iteration wasn't reflected in the
internal_merge_count
PerfContext counter. - Fixed a bug in CreateColumnFamilyWithImport()/ExportColumnFamily() which did not support range tombstones (#11252).
- Fixed a bug where an excluded column family from an atomic flush contains unflushed data that should've been included in this atomic flush (i.e, data of seqno less than the max seqno of this atomic flush), leading to potential data loss in this excluded column family when
WriteOptions::disableWAL == true
(#11148).
New Features
- Add statistics rocksdb.secondary.cache.filter.hits, rocksdb.secondary.cache.index.hits, and rocksdb.secondary.cache.filter.hits
- Added a new PerfContext counter
internal_merge_point_lookup_count
which tracks the number of Merge operands applied while serving point lookup queries. - Add new statistics rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit}
- Add support for SecondaryCache with HyperClockCache (
HyperClockCacheOptions
inheritssecondary_cache
option fromShardedCacheOptions
) - Add new db properties
rocksdb.cf-write-stall-stats
,rocksdb.db-write-stall-stats
and APIs to examine them in a structured way. In particular, users ofGetMapProperty()
with propertykCFWriteStallStats
/kDBWriteStallStats
can now use the functions inWriteStallStatsMapKeys
to find stats in the map.
Public API Changes
- Changed various functions and features in
Cache
that are mostly relevant to custom implementations or wrappers. Especially, asychronous lookup functionality is moved fromLookup()
to a newStartAsyncLookup()
function.
RocksDB 7.10.2
7.10.2 (2023-02-10)
Bug Fixes
- Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.
7.10.1 (2023-02-01)
Bug Fixes
- Fixed a data race on
ColumnFamilyData::flush_reason
caused by concurrent flushes. - Fixed
DisableManualCompaction()
andCompactRangeOptions::canceled
to cancel compactions even when they are waiting on conflicting compactions to finish - Fixed a bug in which a successful
GetMergeOperands()
could transiently returnStatus::MergeInProgress()
- Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
7.10.0 (2023-01-23)
Behavior changes
- Make best-efforts recovery verify SST unique ID before Version construction (#10962)
- Introduce
epoch_number
and sort L0 files byepoch_number
instead oflargest_seqno
.epoch_number
represents the order of a file being flushed or ingested/imported. Compaction output file will be assigned with the minimumepoch_number
among input files'. For L0, largerepoch_number
indicates newer L0 file.
Bug Fixes
- Fixed a regression in iterator where range tombstones after
iterate_upper_bound
is processed. - Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open
- Fixed a bug that multi-level FIFO compaction deletes one file in non-L0 even when
CompactionOptionsFIFO::max_table_files_size
is no exceeded since #10348 or 7.8.0. - Fixed a bug caused by
DB::SyncWAL()
affectingtrack_and_verify_wals_in_manifest
. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892). - Fixed a BackupEngine bug in which RestoreDBFromLatestBackup would fail if the latest backup was deleted and there is another valid backup available.
- Fix L0 file misorder corruption caused by ingesting files of overlapping seqnos with memtable entries' through introducing
epoch_number
. Before the fix,force_consistency_checks=true
may catch the corruption before it's exposed to readers, in which case writes returningStatus::Corruption
would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix. - Fixed a bug in LockWAL() leading to re-locking mutex (#11020).
- Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.
- Fixed a heap use after free in async scan prefetching if dictionary compression is enabled, in which case sync read of the compression dictionary gets mixed with async prefetching
- Fixed a data race bug of
CompactRange()
underchange_level=true
acts on overlapping range with an ongoing file ingestion for level compaction. This will either result in overlapping file ranges corruption at a certain level caught byforce_consistency_checks=true
or protentially two same keys both with seqno 0 in two different levels (i.e, new data ends up in lower/older level). The latter will be caught by assertion in debug build but go silently and result in read returning wrong result in release build. This fix is general so it also replaced previous fixes to a similar problem forCompactFiles()
(#4665), generalCompactRange()
and auto compaction (commit 5c64fb6 and 87dfc1d). - Fixed a bug in compaction output cutting where small output files were produced due to TTL file cutting states were not being updated (#11075).
New Features
- When an SstPartitionerFactory is configured, CompactRange() now automatically selects for compaction any files overlapping a partition boundary that is in the compaction range, even if no actual entries are in the requested compaction range. With this feature, manual compaction can be used to (re-)establish SST partition points when SstPartitioner changes, without a full compaction.
- Add BackupEngine feature to exclude files from backup that are known to be backed up elsewhere, using
CreateBackupOptions::exclude_files_callback
. To restore the DB, the excluded files must be provided in alternative backup directories usingRestoreOptions::alternate_dirs
.
Public API Changes
- Substantial changes have been made to the Cache class to support internal development goals. Direct use of Cache class members is discouraged and further breaking modifications are expected in the future. SecondaryCache has some related changes and implementations will need to be updated. (Unlike Cache, SecondaryCache is still intended to support user implementations, and disruptive changes will be avoided.) (#10975)
- Add
MergeOperationOutput::op_failure_scope
for merge operator users to control the blast radius of merge operator failures. Existing merge operator users do not need to make any change to preserve the old behavior
Performance Improvements
RocksDB 8.0.0
8.0.0 (02/19/2023)
Behavior changes
ReadOptions::verify_checksums=false
disables checksum verification for more reads of non-CacheEntryRole::kDataBlock
blocks.- In case of scan with async_io enabled, if posix doesn't support IOUring, Status::NotSupported error will be returned to the users. Initially that error was swallowed and reads were switched to synchronous reads.
Bug Fixes
- Fixed a data race on
ColumnFamilyData::flush_reason
caused by concurrent flushes. - Fixed an issue in
Get
andMultiGet
when user-defined timestamps is enabled in combination with BlobDB. - Fixed some atypical behaviors for
LockWAL()
such as allowing concurrent/recursive use and not expectingUnlockWAL()
after non-OK result. See API comments. - Fixed a feature interaction bug where for blobs
GetEntity
would expose the blob reference instead of the blob value. - Fixed
DisableManualCompaction()
andCompactRangeOptions::canceled
to cancel compactions even when they are waiting on conflicting compactions to finish - Fixed a bug in which a successful
GetMergeOperands()
could transiently returnStatus::MergeInProgress()
- Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
- Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.
Feature Removal
- Remove RocksDB Lite.
- The feature block_cache_compressed is removed. Statistics related to it are removed too.
- Remove deprecated Env::LoadEnv(). Use Env::CreateFromString() instead.
- Remove deprecated FileSystem::Load(). Use FileSystem::CreateFromString() instead.
- Removed the deprecated version of these utility functions and the corresponding Java bindings:
LoadOptionsFromFile
,LoadLatestOptions
,CheckOptionsCompatibility
. - Remove the FactoryFunc from the LoadObject method from the Customizable helper methods.
Public API Changes
- Moved rarely-needed Cache class definition to new advanced_cache.h, and added a CacheWrapper class to advanced_cache.h. Minor changes to SimCache API definitions.
- Completely removed the following deprecated/obsolete statistics: the tickers
BLOCK_CACHE_INDEX_BYTES_EVICT
,BLOCK_CACHE_FILTER_BYTES_EVICT
,BLOOM_FILTER_MICROS
,NO_FILE_CLOSES
,STALL_L0_SLOWDOWN_MICROS
,STALL_MEMTABLE_COMPACTION_MICROS
,STALL_L0_NUM_FILES_MICROS
,RATE_LIMIT_DELAY_MILLIS
,NO_ITERATORS
,NUMBER_FILTERED_DELETES
,WRITE_TIMEDOUT
,BLOB_DB_GC_NUM_KEYS_OVERWRITTEN
,BLOB_DB_GC_NUM_KEYS_EXPIRED
,BLOB_DB_GC_BYTES_OVERWRITTEN
,BLOB_DB_GC_BYTES_EXPIRED
,BLOCK_CACHE_COMPRESSION_DICT_BYTES_EVICT
as well as the histogramsSTALL_L0_SLOWDOWN_COUNT
,STALL_MEMTABLE_COMPACTION_COUNT
,STALL_L0_NUM_FILES_COUNT
,HARD_RATE_LIMIT_DELAY_COUNT
,SOFT_RATE_LIMIT_DELAY_COUNT
,BLOB_DB_GC_MICROS
, andNUM_DATA_BLOCKS_READ_PER_LEVEL
. Note that as a result, the C++ enum values of the still supported statistics have changed. Developers are advised to not rely on the actual numeric values. - Deprecated IngestExternalFileOptions::write_global_seqno and change default to false. This option only needs to be set to true to generate a DB compatible with RocksDB versions before 5.16.0.
- Remove deprecated APIs
GetColumnFamilyOptionsFrom{Map|String}(const ColumnFamilyOptions&, ..)
,GetDBOptionsFrom{Map|String}(const DBOptions&, ..)
,GetBlockBasedTableOptionsFrom{Map|String}(const BlockBasedTableOptions& table_options, ..)
andGetPlainTableOptionsFrom{Map|String}(const PlainTableOptions& table_options,..)
. - Added a subcode of
Status::Corruption
,Status::SubCode::kMergeOperatorFailed
, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions
Build Changes
- The
make
build now builds a shared library by default instead of a static library. UseLIB_MODE=static
to override.
New Features
- Compaction filters are now supported for wide-column entities by means of the
FilterV3
API. See the comment of the API for more details. - Added
do_not_compress_roles
toCompressedSecondaryCacheOptions
to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default. - Added a new
MultiGetEntity
API that enables batched wide-column point lookups. See the API comments for more details.
RocksDB 7.9.2
7.9.2 (2022-12-21)
Bug Fixes
- Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.
7.9.1 (2022-12-08)
Bug Fixes
- Fixed a regression in iterator where range tombstones after
iterate_upper_bound
is processed. - Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open
Behavior changes
- Make best-efforts recovery verify SST unique ID before Version construction (#10962)
7.9.0 (2022-11-21)
Performance Improvements
- Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).
Bug Fixes
- Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.
- Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as
Status::Corruption
in case offorce_consistency_checks=true
(default). It affects use cases that enable both parallel flush (max_background_flushes > 1
ormax_background_jobs >= 8
) and non-default memtable count (max_write_buffer_number > 2
). - Fixed an issue where the
READ_NUM_MERGE_OPERANDS
ticker was not updated when the base key-value or tombstone was read from an SST file. - Fixed a memory safety bug when using a SecondaryCache with
block_cache_compressed
.block_cache_compressed
no longer attempts to use SecondaryCache features. - Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
- Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.
New Features
- Add basic support for user-defined timestamp to Merge (#10819).
- Add stats for ReadAsync time spent and async read errors.
- Basic support for the wide-column data model is now available. Wide-column entities can be stored using the
PutEntity
API, and retrieved usingGetEntity
and the newcolumns
API of iterator. For compatibility, the classic APIsGet
andMultiGet
, as well as iterator'svalue
API return the value of the anonymous default column of wide-column entities; also,GetEntity
and iterator'scolumns
return any plain key-values in the form of an entity which only has the anonymous default column.Merge
(andGetMergeOperands
) currently also apply to the default column; any other columns of entities are unaffected byMerge
operations. Note that some features like compaction filters, transactions, user-defined timestamps, and the SST file writer do not yet support wide-column entities; also, there is currently noMultiGet
-like API to retrieve multiple entities at once. We plan to gradually close the above gaps and also implement new features like column-level operations (e.g. updating or querying only certain columns of an entity). - Marked HyperClockCache as a production-ready alternative to LRUCache for the block cache. HyperClockCache greatly improves hot-path CPU efficiency under high parallel load or high contention, with some documented caveats and limitations. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
- Add periodic diagnostics to info_log (LOG file) for HyperClockCache block cache if performance is degraded by bad
estimated_entry_charge
option.
Public API Changes
- Marked
block_cache_compressed
as a deprecated feature. Use SecondaryCache instead. - Added a
SecondaryCache::InsertSaved()
API, with default implementation depending onInsert()
. Some implementations might need to add a custom implementation ofInsertSaved()
. (Details in API comments.)
RocksDB 7.8.3
7.8.3 (2022-11-29)
- Revert an internal change in 7.8.0 associated with some memory usage churn.
7.8.2 (2022-11-27)
Behavior changes
- Make best-efforts recovery verify SST unique ID before Version construction (#10962)
- Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as
Status::Corruption
in case offorce_consistency_checks=true
(default). It affects use cases that enable both parallel flush (max_background_flushes > 1
ormax_background_jobs >= 8
) and non-default memtable count (max_write_buffer_number > 2
). - Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.
Bug Fixes
- Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
- Fixed a performance regression in iterator where range tombstones after
iterate_upper_bound
is processed.
7.8.1 (2022-11-02)
Bug Fixes
- Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.
7.8.0 (2022-10-22)
New Features
DeleteRange()
now supports user-defined timestamp.- Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
- Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
- Added
DB::Properties::kFastBlockCacheEntryStats
, which is similar toDB::Properties::kBlockCacheEntryStats
, except returns cached (stale) values in more cases to reduce overhead. - FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
- picks the sst file with the smallest starting key in the bottom-most non-empty level.
- Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
- Added an option
ignore_max_compaction_bytes_for_input
to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default. - Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
- Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
- Add option
preserve_internal_time_seconds
to preserve the time information for the latest data. Which can be used to determine the age of data whenpreclude_last_level_data_seconds
is enabled. The time information is attached with SST in table propertyrocksdb.seqno.time.map
which can be parsed by tool ldb or sst_dump.
Bug Fixes
- Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
- Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
- Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
- Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
- Fixed a bug causing manual flush with
flush_opts.wait=false
to stall when database has stopped all writes (#10001). - Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
- Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
- Fixed a memory safety bug in experimental HyperClockCache (#10768)
- Fixed some cases where
ldb update_manifest
andldb unsafe_remove_sst_file
are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).
Performance Improvements
- Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set
AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size
to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files. - Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
- Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.
- Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).
Behavior Changes
- Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).
Public API changes
- Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang (#9891). DBs written with this new setting can be read by RocksDB 6.27 and newer.
- Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.
RocksDB 7.7.8
7.7.8 (2022-11-27)
Bug Fixes
- Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as
Status::Corruption
in case offorce_consistency_checks=true
(default). It affects use cases that enable both parallel flush (max_background_flushes > 1
ormax_background_jobs >= 8
) and non-default memtable count (max_write_buffer_number > 2
). - Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.
- Fixed a regression in iterator where range tombstones after
iterate_upper_bound
is processed.
7.7.7 (2022-11-15)
Bug Fixes
- Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
7.7.6 (2022-11-03)
Bug Fixes
- Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.
7.7.5 (2022-10-28)
Bug Fixes
- Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).
7.7.4 (2022-10-28)
Bug Fixes
- Fixed a case of calling malloc_usable_size on result of operator new[].
RocksDB 7.7.3
RocksDB 7.7.2
7.7.2 (2022-10-05)
Bug Fixes
- Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
- Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
Behavior Changes
- Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).
7.7.1 (2022-09-26)
Bug Fixes
- Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
- Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
7.7.0 (2022-09-18)
Bug Fixes
- Fixed a hang when an operation such as
GetLiveFiles
orCreateNewBackup
is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB. - Fixed bug where
FlushWAL(true /* sync */)
(used byGetLiveFilesStorageInfo()
, which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced. - Fix periodic_task unable to re-register the same task type, which may cause
SetOptions()
fail to update periodical_task time like:stats_dump_period_sec
,stats_persist_period_sec
. - Fixed a bug in the rocksdb.prefetched.bytes.discarded stat. It was counting the prefetch buffer size, rather than the actual number of bytes discarded from the buffer.
- Fix bug where the directory containing CURRENT can left unsynced after CURRENT is updated to point to the latest MANIFEST, which leads to risk of unsync data loss of CURRENT.
- Update rocksdb.multiget.io.batch.size stat in non-async MultiGet as well.
- Fix a bug in key range overlap checking with concurrent compactions when user-defined timestamp is enabled. User-defined timestamps should be EXCLUDED when checking if two ranges overlap.
- Fixed a bug where the blob cache prepopulating logic did not consider the secondary cache (see #10603).
- Fixed the rocksdb.num.sst.read.per.level, rocksdb.num.index.and.filter.blocks.read.per.level and rocksdb.num.level.read.per.multiget stats in the MultiGet coroutines
- Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
- Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
- Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
- Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
Public API changes
- Add
rocksdb_column_family_handle_get_id
,rocksdb_column_family_handle_get_name
to get name, id of column family in C API - Add a new stat rocksdb.async.prefetch.abort.micros to measure time spent waiting for async prefetch reads to abort
Java API Changes
- Add CompactionPriority.RoundRobin.
- Revert to using the default metadata charge policy when creating an LRU cache via the Java API.
Behavior Change
- DBOptions::verify_sst_unique_id_in_manifest is now an on-by-default feature that verifies SST file identity whenever they are opened by a DB, rather than only at DB::Open time.
- Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.
- When a block is firstly found from
CompressedSecondaryCache
, we just insert a dummy block into the primary cache and don’t erase the block fromCompressedSecondaryCache
. A standalone handle is returned to the caller. Only if the block is found again fromCompressedSecondaryCache
before the dummy block is evicted, we erase the block fromCompressedSecondaryCache
and insert it into the primary cache. - When a block is firstly evicted from the primary cache to
CompressedSecondaryCache
, we just insert a dummy block inCompressedSecondaryCache
. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted intoCompressedSecondaryCache
. - Improved the estimation of memory used by cached blobs by taking into account the size of the object owning the blob value and also the allocator overhead if
malloc_usable_size
is available (see #10583). - Blob values now have their own category in the cache occupancy statistics, as opposed to being lumped into the "Misc" bucket (see #10601).
- Change the optimize_multiget_for_io experimental ReadOptions flag to default on.
New Features
- RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option
num_file_reads_for_auto_readahead
is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2). - Added new perf context counters
block_cache_standalone_handle_count
,block_cache_real_handle_count
,compressed_sec_cache_insert_real_count
,compressed_sec_cache_insert_dummy_count
,compressed_sec_cache_uncompressed_bytes
, andcompressed_sec_cache_compressed_bytes
. - Memory for blobs which are to be inserted into the blob cache is now allocated using the cache's allocator (see #10628 and #10647).
- HyperClockCache is an experimental, lock-free Cache alternative for block cache that offers much improved CPU efficiency under high parallel load or high contention, with some caveats. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
CompressedSecondaryCacheOptions::enable_custom_split_merge
is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.
Performance Improvements
- Iterator performance is improved for
DeleteRange()
users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted. - Eliminated some allocations and copies in the blob read path. Also,
PinnableSlice
now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647. - In case of scans with async_io enabled, few optimizations have been added to issue more asynchronous requests in parallel in order to avoid synchronous prefetching.
DeleteRange()
users should see improvement in get/iterator performance from mutable memtable (see #10547).
7.6.0 (2022-08-19)
New Features
- Added
prepopulate_blob_cache
to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies. - Support using secondary cache with the blob cache. When creating a blob cache, the user can set a secondary blob cache by configuring
secondary_cache
in LRUCacheOptions. - Charge memory usage of blob cache when the backing cache of the blob cache and the block cache are different. If an operation reserving memory for blob cache exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under
LRUCacheOptions::strict_capacity_limit
= true), creation will fail withStatus::MemoryLimit()
. To opt in this feature, enable chargingCacheEntryRole::kBlobCache
inBlockBasedTableOptions::cache_usage_options
. - Improve subcompaction range partition so that it is likely to be more even. More evenly distribution of subcompaction will improve compaction throughput for some workloads. All input files' index blocks to sample some anchor key points from which we pick positions to partition the input range. This would introduce some CPU overhead in compaction preparation phase, if subcompaction is enabled, but it should be a small fraction of the CPU usage of the whole compaction process. This also brings a behavier change: subcompaction number is much more likely to maxed out than before.
- Add CompactionPri::kRoundRobin, a compaction picking mode that cycles through all the files with a compact cursor in a round-robin manner. This feature is available since 7.5.
- Provide support for subcompactions for user_defined_timestamp.
- Added an option
memtable_protection_bytes_per_key
that turns on memtable per key-value checksum protection. Each memtable entry will be suffixed by a checksum that is computed during writes, and verified in reads/compaction. Detected corruption will be logged and with corruption status returned to user. - Added a blob-specific cache priority level - bottom level. Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. The user can specify the new option
low_pri_pool_ratio
inLRUCacheOptions
to configure the ratio of capacity reserved for low priority cache entries (and therefore the remaining ratio is the space reserved for the bottom level), or configuring the new argumentlow_pri_pool_ratio
inNewLRUCache()
to achieve the same effect.
Public API changes
- Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.
CompactRangeOptions::exclusive_manual_compaction
is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default.- Tiered Storage: change
bottommost_temperture
tolast_level_temperture
. The old option name is kept only for migration, please use the new option. The behavior is changed to apply temperature for thelast_level
SST files only. - Added a new experimental ReadOption flag called optimize_multiget_for_io, which when set attempts to reduce MultiGet latency by spawning coroutines for keys in multiple levels.
Bug Fixes
- Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to
FSDirectory::Fsync
orFSDirectory::Close
after the firstFSDirectory::Close
; Also, valgrind could report call toclose()
withfd=-1
.) - Fix a bug where
GenericRateLimiter
could revert the bandwidth set dynamically usingSetBytesPerSecond()
when a user configures a structure enclosing it, e.g., usingGetOptionsFromString()
to configure anOptions
that references an existingRateLimiter
object. - Fix race conditions in
GenericRateLimiter
. - Fix a bug in
FIFOCompactionPicker::PickTTLCompaction
where total_size calculating might cause underflow - Fix data race bug in hash linked list memtable. With this bug, read request might temporarily miss an old record in the memtable in a race condition to the hash bucket.
- Fix a bug that
best_efforts_recovery
may fail to open the db with mmap read. - Fixed a bug where blobs read during compaction would pollute the cache.
- Fixed a data race in LRUCache when used with a secondary_cache.
- Fixed a bug where blobs read by iterators would be inserted into the cache even with the
fill_cache
read option set to false. - Fixed the segfault caused by
AllocateData()
inCompressedSecondaryCache::SplitValueIntoChunks()
andMergeChunksIntoValueTest
. - Fixed a bug in BlobDB where a mix of inlined and blob values could result in an incorrect value being passed to the compaction filter (see #10391).
- Fixed a memory leak bug in stress tests caused by
FaultInjectionSecondaryCache
.
Behavior Change
- Added checksum handshake during the copying of decompressed WAL fragment. This together with #9875, #10037, #10212, #10114 and #10319 provides end-to-end integrity protection for write batch during recovery.
- To minimize the internal fragmentation caused by the variable size of the compressed blocks in
CompressedSecondaryCache
, the original block is split according to the jemalloc bin size inInsert()
and then merged back inLookup()
. - PosixLogger is removed and by default EnvLogger will be used for info logging. The behavior of the two loggers should be very similar when using the default Posix Env.
- Remove [min|max]_timestamp from VersionEdit for now since they are not tracked in MANIFEST anyway but consume two empty std::string (up to 64 bytes) for each file. Should they be added back in the future, we should store them more compactly.
- Improve universal tiered storage compaction picker to avoid extra major compaction triggered by size amplification. If
preclude_last_level_data_seconds
is enabled, the size amplification is calculated within non last_level data only which skip the last level and use the penultimate level as the size base. - If an error is hit when writing to a file (append, sync, etc), RocksDB is more strict with not issuing more operations to it, except closing the file, with exceptions of some WAL file operations in error recovery path.
- A
WriteBufferManager
constructed withallow_stall == false
will no longer trigger write stall implicitly by thrashing until memtable count limit is reached. Instead, a column family can continue accumulating writes while that CF is flushing, which means memory may increase. Users who prefer stalling writes must now explicitly setallow_stall == true
. - Add
CompressedSecondaryCache
into the stress tests. - Block cache keys have changed, which will cause any persistent caches to miss between versions.
Performance Improvements
- Instead of constructing
FragmentedRangeTombstoneList
during every read operation, it is now constructed once and stored in immutable memtables. This improves speed of querying range tombstones from immutable memtables. - When using iterators with the integrated BlobDB implementation, blob cache handles are now released immediately when the iterator's position changes.
- MultiGet can now do more IO in parallel by reading data blocks from SST files in multiple levels, if the optimize_multiget_for_io ReadOption flag is set.