Releases: facebook/rocksdb
Releases · facebook/rocksdb
RocksDB 9.5.2
9.5.2 (2024-08-13)
Bug Fixes
- Fix a race condition in pessimistic transactions that could allow multiple transactions with the same name to be registered simultaneously, resulting in a crash or other unpredictable behavior.
Public API Changes
- Add ticker stats to count file read retries due to checksum mismatch
9.5.1 (2024-08-02)
Bug Fixes
- *Make DestroyDB supports slow deletion when it's configured in
SstFileManager
. The slow deletion is subject to the configuredrate_bytes_per_sec
, but not subject to themax_trash_db_ratio
.
9.5.0 (2024-07-19)
Public API Changes
- Introduced new C API function rocksdb_writebatch_iterate_cf for column family-aware iteration over the contents of a WriteBatch
- Add support to ingest SST files generated by a DB instead of SstFileWriter. This can be enabled with experimental option
IngestExternalFileOptions::allow_db_generated_files
.
Behavior Changes
- When calculating total log size for the
log_size_for_flush
argument inCreateCheckpoint
API, the size of the archived log will not be included to avoid unnecessary flush
Bug Fixes
- Fix a major bug in which an iterator using prefix filtering and SeekForPrev might miss data when the DB is using
whole_key_filtering=false
andpartition_filters=true
. - Fixed a bug where
OnErrorRecoveryBegin()
is not called before auto recovery starts. - Fixed a bug where event listener reads ErrorHandler's
bg_error_
member without holding db mutex(#12803). - Fixed a bug in handling MANIFEST write error that caused the latest valid MANIFEST file to get deleted, resulting in the DB being unopenable.
- Fixed a race between error recovery due to manifest sync or write failure and external SST file ingestion. Both attempt to write a new manifest file, which causes an assertion failure.
Performance Improvements
- Fix an issue where compactions were opening table files and reading table properties while holding db mutex_.
- Reduce unnecessary filesystem queries and DB mutex acquires in creating backups and checkpoints.
RocksDB 9.4.0
9.4.0 (2024-06-23)
New Features
- Added a
CompactForTieringCollectorFactory
to auto trigger compaction for tiering use case. - Optimistic transactions and pessimistic transactions with the WriteCommitted policy now support the
GetEntityForUpdate
API. - Added a new "count" command to the ldb repl shell. By default, it prints a count of keys in the database from start to end. The options --from= and/or --to= can be specified to limit the range.
- Add
rocksdb_writebatch_update_timestamps
,rocksdb_writebatch_wi_update_timestamps
in C API. - Add
rocksdb_iter_refresh
in C API. - Add
rocksdb_writebatch_create_with_params
,rocksdb_writebatch_wi_create_with_params
to create WB and WBWI with all options in C API
Public API Changes
- Deprecated names
LogFile
andVectorLogPtr
in favor of new namesWalFile
andVectorWalPtr
. - Introduce a new universal compaction option CompactionOptionsUniversal::max_read_amp which allows user to define the limit on the number of sorted runs separately from the trigger for compaction (
level0_file_num_compaction_trigger
) #12477.
Behavior Changes
- Inactive WALs are immediately closed upon being fully sync-ed rather than in a background thread. This is to ensure LinkFile() is not called on files still open for write, which might not be supported by some FileSystem implementations. This should not be a performance issue, but an opt-out is available with with new DB option
background_close_inactive_wals
.
Bug Fixes
- Fix a rare case in which a hard-linked WAL in a Checkpoint is not fully synced (so might lose data on power loss).
- Fixed the output of the
ldb dump_wal
command forPutEntity
records so it prints the key and correctly resets the hexadecimal formatting flag after printing the wide-column entity. - Fixed an issue where
PutEntity
records were handled incorrectly while rebuilding transactions during recovery. - Various read operations could ignore various ReadOptions that might be relevant. Fixed many such cases, which can result in behavior change but a better reflection of specified options.
Performance Improvements
- Improved write throughput to memtable when there's a large number of concurrent writers and allow_concurrent_memtable_write=true(#12545)
RocksDB 9.3.1
9.3.1 (2024-05-25)
Bug Fixes
- [internal only] Build script improvement
9.3.0 (2024-05-17)
New Features
- Optimistic transactions and pessimistic transactions with the WriteCommitted policy now support the
GetEntity
API. - Added new
Iterator
property, "rocksdb.iterator.is-value-pinned", for checking whether theSlice
returned byIterator::value()
can be used until theIterator
is destroyed. - Optimistic transactions and WriteCommitted pessimistic transactions now support the
MultiGetEntity
API. - Optimistic transactions and pessimistic transactions with the WriteCommitted policy now support the
PutEntity
API. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added later.
Public API Changes
- Exposed block based metadata cache options via C API
- Exposed compaction pri via c api.
- Add a kAdmPolicyAllowAll option to TieredAdmissionPolicy that admits all blocks evicted from the primary block cache into the compressed secondary cache.
Behavior Changes
- CompactRange() with change_level=true on a CF with FIFO compaction will return Status::NotSupported().
- External file ingestion with FIFO compaction will always ingest to L0.
Bug Fixes
- Fixed a bug for databases using
DBOptions::allow_2pc == true
(allTransactionDB
s exceptOptimisticTransactionDB
) that have exactly one column family. Due to a missing WAL sync, attempting to open the DB could have returned aStatus::Corruption
with a message like "SST file is ahead of WALs". - Fix a bug in CreateColumnFamilyWithImport() where if multiple CFs are imported, we were not resetting files' epoch number and L0 files can have overlapping key range but the same epoch number.
- Fixed race conditions when
ColumnFamilyOptions::inplace_update_support == true
between user overwrites and reads on the same key. - Fix a bug where
CompactFiles()
can compact files of range conflict with other ongoing compactions' whenpreclude_last_level_data_seconds > 0
is used - Fixed a false positive
Status::Corruption
reported when reopening a DB that usedDBOptions::recycle_log_file_num > 0
andDBOptions::wal_compression != kNoCompression
. - While WAL is locked with LockWAL(), some operations like Flush() and IngestExternalFile() are now blocked as they should have been.
- Fixed a bug causing stale memory access when using the TieredSecondaryCache with an NVM secondary cache, and a file system that supports return an FS allocated buffer for MultiRead (FSSupportedOps::kFSBuffer is set).
RocksDB 9.2.1
9.2.1 (2024-05-03)
Public API Changes
- Add a kAdmPolicyAllowAll option to TieredAdmissionPolicy that admits all blocks evicted from the primary block cache into the compressed secondary cache.
9.2.0 (2024-05-01)
New Features
- Added two options
deadline
andmax_size_bytes
for CacheDumper to exit early - Added a new API
GetEntityFromBatchAndDB
toWriteBatchWithIndex
that can be used for wide-column point lookups with read-your-own-writes consistency. Similarly toGetFromBatchAndDB
, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details. - [Experimental] Introduce two new cross-column-family iterators - CoalescingIterator and AttributeGroupIterator. The CoalescingIterator enables users to iterate over multiple column families and access their values and columns. During this iteration, if the same key exists in more than one column family, the keys in the later column family will overshadow the previous ones. The AttributeGroupIterator allows users to gather wide columns per Column Family and create attribute groups while iterating over keys across all CFs.
- Added a new API
MultiGetEntityFromBatchAndDB
toWriteBatchWithIndex
that can be used for batched wide-column point lookups with read-your-own-writes consistency. Similarly toMultiGetFromBatchAndDB
, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details. - *Adds a
SstFileReader::NewTableIterator
API to support programmatically read a SST file as a raw table file. - Add an option to
WaitForCompactOptions
-wait_for_purge
to makeWaitForCompact()
API wait for background purge to complete
Public API Changes
- DeleteRange() will return NotSupported() if row_cache is configured since they don't work together in some cases.
- Deprecated
CompactionOptions::compression
sinceCompactionOptions
's API for configuring compression was incomplete, unsafe, and likely unnecessary - Using
OptionChangeMigration()
to migrate from non-FIFO to FIFO compaction
withOptions::compaction_options_fifo.max_table_files_size
> 0 can cause
the whole DB to be dropped right after migration if the migrated data is larger than
max_table_files_size
Behavior Changes
- Enabling
BlockBasedTableOptions::block_align
is now incompatible (i.e., APIs will returnStatus::InvalidArgument
) with more ways of enabling compression:CompactionOptions::compression
,ColumnFamilyOptions::compression_per_level
, andColumnFamilyOptions::bottommost_compression
. - Changed the default value of
CompactionOptions::compression
tokDisableCompressionOption
, which means the compression type is determined by theColumnFamilyOptions
. BlockBasedTableOptions::optimize_filters_for_memory
is now set to true by default. Whenpartition_filters=false
, this could lead to somewhat increased average RSS memory usage by the block cache, but this "extra" usage is within the allowed memory budget and should make memory usage more consistent (by minimizing internal fragmentation for more kinds of blocks).- Dump all keys for cache dumper impl if
SetDumpFilter()
is not called CompactRange()
withCompactRangeOptions::change_level = true
andCompactRangeOptions::target_level = 0
that ends up moving more than 1 file from non-L0 to L0 will returnStatus::Aborted()
.- On distributed file systems that support file system level checksum verification and reconstruction reads, RocksDB will now retry a file read if the initial read fails RocksDB block level or record level checksum verification. This applies to MANIFEST file reads when the DB is opened, and to SST file reads at all times.
Bug Fixes
- Fix a bug causing
VerifyFileChecksums()
to return false-positive corruption underBlockBasedTableOptions::block_align=true
- Provide consistent view of the database across the column families for
NewIterators()
API. - Fixed feature interaction bug for
DeleteRange()
together withColumnFamilyOptions::memtable_insert_with_hint_prefix_extractor
. The impact of this bug would likely be corruption or crashing. - Fixed hang in
DisableManualCompactions()
where compactions waiting to be scheduled due to conflicts would not be canceled promptly - Fixed a regression when
ColumnFamilyOptions::max_successive_merges > 0
where the CPU overhead for deciding whether to merge could have increased unless the user had set the optionColumnFamilyOptions::strict_max_successive_merges
- Fixed a bug in
MultiGet()
andMultiGetEntity()
together with blob files (ColumnFamilyOptions::enable_blob_files == true
). An error looking up one of the keys could cause the results to be wrong for other keys for which the statuses wereStatus::OK
. - Fixed a bug where wrong padded bytes are used to generate file checksum and
DataVerificationInfo::checksum
upon file creation - Correctly implemented the move semantics of
PinnableWideColumns
. - Fixed a bug when the recycle_log_file_num in DBOptions is changed from 0 to non-zero when a DB is reopened. On a subsequent reopen, if a log file created when recycle_log_file_num==0 was reused previously, is alive and is empty, we could end up inserting stale WAL records into the memtable.
- *Fix a bug where obsolete files' deletion during DB::Open are not rate limited with
SstFilemManager
's slow deletion feature even if it's configured.
RocksDB 9.1.1
9.1.1 (2024-04-17)
Bug Fixes
- Fixed Java
SstFileMetaData
to prevent throwingjava.lang.NoSuchMethodError
- Fixed a regression when
ColumnFamilyOptions::max_successive_merges > 0
where the CPU overhead for deciding whether to merge could have increased unless the user had set the optionColumnFamilyOptions::strict_max_successive_merges
RocksDB 9.1.0
9.1.0 (2024-03-22)
New Features
- Added an option,
GetMergeOperandsOptions::continue_cb
, to give users the ability to endGetMergeOperands()
's lookup process before all merge operands were found. - *Add sanity checks for ingesting external files that currently checks if the user key comparator used to create the file is compatible with the column family's user key comparator.
*Support ingesting external files for column family that has user-defined timestamps in memtable only enabled. - On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read.
- Some enhancements and fixes to experimental Temperature handling features, including new
default_write_temperature
CF option and opening anSstFileWriter
with a temperature. WriteBatchWithIndex
now supports wide-column point lookups via theGetEntityFromBatch
API. See the API comments for more details.- *Implement experimental features: API
Iterator::GetProperty("rocksdb.iterator.write-time")
to allow users to get data's approximate write unix time and write data with a specific write time viaWriteBatch::TimedPut
API.
Public API Changes
- Best-effort recovery (
best_efforts_recovery == true
) may now be used together with atomic flush (atomic_flush == true
). The all-or-nothing recovery guarantee for atomically flushed data will be upheld. - Remove deprecated option
bottommost_temperature
, already replaced bylast_level_temperature
- Added new PerfContext counters for block cache bytes read - block_cache_index_read_byte, block_cache_filter_read_byte, block_cache_compression_dict_read_byte, and block_cache_read_byte.
- Deprecate experimental Remote Compaction APIs - StartV2() and WaitForCompleteV2() and introduce Schedule() and Wait(). The new APIs essentially does the same thing as the old APIs. They allow taking externally generated unique id to wait for remote compaction to complete.
- *For API
WriteCommittedTransaction::GetForUpdate
, if the column family enables user-defined timestamp, it was mandated that argumentdo_validate
cannot be false, and UDT based validation has to be done with a user set read timestamp. It's updated to make the UDT based validation optional if user setsdo_validate
to false and does not set a read timestamp. With this,GetForUpdate
skips UDT based validation and it's users' responsibility to enforce the UDT invariant. SO DO NOT skip this UDT-based validation if users do not have ways to enforce the UDT invariant. Ways to enforce the invariant on the users side include manage a monotonically increasing timestamp, commit transactions in a single thread etc. - Defined a new PerfLevel
kEnableWait
to measure time spent by user threads blocked in RocksDB other than mutex, such as a write thread waiting to be added to a write group, a write thread delayed or stalled etc. RateLimiter
's API no longer requires the burst size to be the refill size. Users ofNewGenericRateLimiter()
can now provide burst size insingle_burst_bytes
. Implementors ofRateLimiter::SetSingleBurstBytes()
need to adapt their implementations to match the changed API doc.- Add
write_memtable_time
to the newly introduced PerfLevelkEnableWait
.
Behavior Changes
RateLimiter
s created byNewGenericRateLimiter()
no longer modify the refill period whenSetSingleBurstBytes()
is called.- Merge writes will only keep merge operand count within
ColumnFamilyOptions::max_successive_merges
when the key's merge operands are all found in memory, unlessstrict_max_successive_merges
is explicitly set.
Bug Fixes
- Fixed
kBlockCacheTier
reads to returnStatus::Incomplete
when I/O is needed to fetch a merge chain's base value from a blob file. - Fixed
kBlockCacheTier
reads to returnStatus::Incomplete
on table cache miss rather than incorrectly returning an empty value. - Fixed a data race in WalManager that may affect how frequent PurgeObsoleteWALFiles() runs.
- Re-enable the recycle_log_file_num option in DBOptions for kPointInTimeRecovery WAL recovery mode, which was previously disabled due to a bug in the recovery logic. This option is incompatible with WriteOptions::disableWAL. A Status::InvalidArgument() will be returned if disableWAL is specified.
Performance Improvements
- Java API
multiGet()
variants now take advantage of the underlying batchedmultiGet()
performance improvements.
Before
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 6315.541 ± 8.106 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 6975.468 ± 68.964 ops/s
After
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 7046.739 ± 13.299 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 7654.521 ± 60.121 ops/s
RocksDB 9.0.1
9.0.1 (2024-04-11)
Bug Fixes
- Fixed CMake Javadoc and source jar builds
- Fixed Java
SstFileMetaData
to prevent throwingjava.lang.NoSuchMethodError
RocksDB 8.11.4
8.11.4 (2024-04-09)
Bug Fixes
- Fixed CMake Javadoc build
- Fixed Java
SstFileMetaData
to prevent throwingjava.lang.NoSuchMethodError
RocksDB 9.0.0
9.0.0 (2024-02-16)
New Features
- Provide support for FSBuffer for point lookups. Also added support for scans and compactions that don't go through prefetching.
- *Make
SstFileWriter
create SST files without persisting user defined timestamps when theOption.persist_user_defined_timestamps
flag is set to false. - Add support for user-defined timestamps in APIs
DeleteFilesInRanges
andGetPropertiesOfTablesInRange
. - Mark wal_compression feature as production-ready. Currently only compatible with ZSTD compression.
Public API Changes
- Allow setting Stderr logger via C API
- Declare one Get and one MultiGet variant as pure virtual, and make all the other variants non-overridable. The methods required to be implemented by derived classes of DB allow returning timestamps. It is up to the implementation to check and return an error if timestamps are not supported. The non-batched MultiGet APIs are reimplemented in terms of batched MultiGet, so callers might see a performance improvement.
- Exposed mode option to Rate Limiter via c api.
- Removed deprecated option
access_hint_on_compaction_start
- Removed deprecated option
ColumnFamilyOptions::check_flush_compaction_key_order
- *Remove the default
WritableFile::GetFileSize
andFSWritableFile::GetFileSize
implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation. - Removed deprecated option
ColumnFamilyOptions::level_compaction_dynamic_file_size
- *Removed tickers with typos "rocksdb.error.handler.bg.errro.count", "rocksdb.error.handler.bg.io.errro.count", "rocksdb.error.handler.bg.retryable.io.errro.count".
- Remove the force mode for
EnableFileDeletions
API because it is unsafe with no known legitimate use. - Removed deprecated option
ColumnFamilyOptions::ignore_max_compaction_bytes_for_input
sst_dump --command=check
now compares the number of records in a table withnum_entries
in table property, and reports corruption if there is a mismatch. APISstFileDumper::ReadSequential()
is updated to optionally do this verification. (#12322)
Behavior Changes
- format_version=6 is the new default setting in BlockBasedTableOptions, for more robust data integrity checking. DBs and SST files written with this setting cannot be read by RocksDB versions before 8.6.0.
- Compactions can be scheduled in parallel in an additional scenario: multiple files are marked for compaction within a single column family
- For leveled compaction, RocksDB will try to do intra-L0 compaction if the total L0 size is small compared to Lbase (#12214). Users with atomic_flush=true are more likely to see the impact of this change.
Bug Fixes
- Fixed a data race in
DBImpl::RenameTempFileToOptionsFile
. - Fix some perf context statistics error in write steps. which include missing write_memtable_time in unordered_write. missing write_memtable_time in PipelineWrite when Writer stat is STATE_PARALLEL_MEMTABLE_WRITER. missing write_delay_time when calling DelayWrite in WriteImplWALOnly function.
- Fixed a bug that can, under rare circumstances, cause MultiGet to return an incorrect result for a duplicate key in a MultiGet batch.
- Fix a bug where older data of an ingested key can be returned for read when universal compaction is used
RocksDB 8.11.3
8.11.3 (2024-02-27)
- Correct CMake Javadoc and source jar builds
8.11.2 (2024-02-16)
- Update zlib to 1.3.1 for Java builds
8.11.1 (2024-01-25)
Bug Fixes
- Fix a bug where older data of an ingested key can be returned for read when universal compaction is used
- Apply appropriate rate limiting and priorities in more places.
8.11.0 (2024-01-19)
New Features
- Add new statistics:
rocksdb.sst.write.micros
measures time of each write to SST file;rocksdb.file.write.{flush|compaction|db.open}.micros
measure time of each write to SST table (currently only block-based table format) and blob file for flush, compaction and db open.
Public API Changes
- Added another enumerator
kVerify
to enum classFileOperationType
in listener.h. Update yourswitch
statements as needed. - Add CompressionOptions to the CompressedSecondaryCacheOptions structure to allow users to specify library specific options when creating the compressed secondary cache.
- Deprecated several options:
level_compaction_dynamic_file_size
,ignore_max_compaction_bytes_for_input
,check_flush_compaction_key_order
,flush_verify_memtable_count
,compaction_verify_record_count
,fail_if_options_file_error
, andenforce_single_del_contracts
- Exposed options ttl via c api.
Behavior Changes
rocksdb.blobdb.blob.file.write.micros
expands to also measure time writing the header and footer. Therefore the COUNT may be higher and values may be smaller than before. For stacked BlobDB, it no longer measures the time of explictly flushing blob file.- Files will be compacted to the next level if the data age exceeds periodic_compaction_seconds except for the last level.
- Reduced the compaction debt ratio trigger for scheduling parallel compactions
- For leveled compaction with default compaction pri (kMinOverlappingRatio), files marked for compaction will be prioritized over files not marked when picking a file from a level for compaction.
Bug Fixes
- Fix bug in auto_readahead_size that combined with IndexType::kBinarySearchWithFirstKey + fails or iterator lands at a wrong key
- Fixed some cases in which DB file corruption was detected but ignored on creating a backup with BackupEngine.
- Fix bugs where
rocksdb.blobdb.blob.file.synced
includes blob files failed to get synced androcksdb.blobdb.blob.file.bytes.written
includes blob bytes failed to get written. - Fixed a possible memory leak or crash on a failure (such as I/O error) in automatic atomic flush of multiple column families.
- Fixed some cases of in-memory data corruption using mmap reads with
BackupEngine
,sst_dump
, orldb
. - Fixed issues with experimental
preclude_last_level_data_seconds
option that could interfere with expected data tiering. - Fixed the handling of the edge case when all existing blob files become unreferenced. Such files are now correctly deleted.