Add corruption guards against hardware bit flips by xingbowang · Pull Request #14354 · facebook/rocksdb

xingbowang · 2026-02-19T20:25:42Z

Summary:
Investigation of a production SST corruption revealed that a single hardware bit flip (bit 32) in a value Slice's size_ field during compaction caused a 5.6MB value to be treated as 4.3GB. The corruption was silent because:

BlockBuilder's varint encoding uses static_cast<uint32_t>(value.size()), which truncated the corrupted size back to the correct value
buffer_.append(value.data(), value.size()) used the full 64-bit corrupted size, appending 4GB of adjacent heap memory into the block
The block checksum was computed over the corrupted data, so it matched
Compression was bypassed (block exceeded kCompressionSizeLimit)
paranoid_file_checks was disabled

This change adds 2 layers of defense:

BlockBuilder uint32 truncation guard: In AddWithLastKeyImpl, detect when value.size() exceeds uint32_t range before the varint encoding silently truncates it. This catches bit flips in bits 32-63 and aborts immediately, preventing the massive buffer_.append from occurring. This applies to both flush and compaction paths since all key-value pairs flow through BlockBuilder::AddWithLastKeyImpl.
Compaction output/input size ratio check: New option max_compaction_output_to_input_ratio (default: 10, 0 to disable) in MutableCFOptions. After compaction, if total output size exceeds this ratio times total input size, return Status::Corruption. This catches cases where corrupted values inflate the output file far beyond what input data justifies. This check applies to compaction only (not flush), since flush writes from a memtable and does not have input files for comparison. However, flush output is still protected by guards Miss Spelling in README #1 and OSX compilation #2.

Test Plan:
Unit test

meta-codesync · 2026-02-20T15:22:20Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D93868104.

Summary: Investigation of a production SST corruption revealed that a single hardware bit flip (bit 32) in a value Slice's size_ field during compaction caused a 5.6MB value to be treated as 4.3GB. The corruption was silent because: - BlockBuilder's varint encoding uses static_cast<uint32_t>(value.size()), which truncated the corrupted size back to the correct value - buffer_.append(value.data(), value.size()) used the full 64-bit corrupted size, appending 4GB of adjacent heap memory into the block - The block checksum was computed over the corrupted data, so it matched - paranoid_file_checks was disabled This change adds 2 layers of defense: 1. **Value size uint32 truncation guard**: In BlockBasedTableBuilder::Add(), detect when value.size() exceeds uint32_t range before passing it to BlockBuilder where the varint encoding would silently truncate it. Returns Status::Corruption so the flush/compaction can abort gracefully without crashing the process. This applies to both flush and compaction paths since all key-value pairs flow through BlockBasedTableBuilder::Add(). 2. **Compaction output/input size ratio check**: New option max_compaction_output_to_input_ratio (default: 10, 0 to disable) in MutableCFOptions. After compaction, if total output size exceeds this ratio times total input size, return Status::Corruption. This catches cases where corrupted values inflate the output file far beyond what input data justifies. This check applies to compaction only (not flush), since flush writes from a memtable and does not have input files for comparison. However, flush output is still protected by guards facebook#1 and facebook#2. Additionally, both the flush loop (db/builder.cc) and the compaction output loop (db/compaction/compaction_outputs.cc) now check builder->status() after each Add() call. Previously, a corruption error set inside the builder was only surfaced when Finish() was called, causing the loop to continue iterating through potentially millions of keys doing wasted work. Test Plan: Unit test

meta-codesync · 2026-02-20T17:19:41Z

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D93868104.

jaykorean · 2026-02-20T23:25:18Z

db/db_impl/db_impl.h


+  // Count consecutive transient in-memory corruption errors (e.g., hardware
+  // bit flips). On first occurrence, allow retry. On second, escalate to fatal.
+  int transient_corruption_retry_count_ = 0;


So if we run into this corruption in one compaction then while retrying if we detect another corruption from another compaction (or flush), we will give up I guess (even though retry may go through for both).

jaykorean · 2026-02-20T23:43:30Z

db/db_impl/db_impl_compaction_flush.cc

-    if (log_io_s.ok()) {
+    if (ShouldRetryTransientCorruption(s, cfd->GetName().c_str())) {
+      // Transient corruption: memtable has been rolled back and
+      // imm_flush_needed is set, so the flush will be rescheduled.


Asking for my own understanding - where are we setting imm_flush_needed back to true? Or do we call RollbackMemtableFlush() in this flow?

jaykorean · 2026-02-20T23:44:07Z

include/rocksdb/advanced_options.h

+  // Default: 10 (output cannot exceed 10x input size)
+  //
+  // Dynamically changeable through SetOptions() API.
+  uint64_t max_compaction_output_to_input_ratio = 10;


nit: I wonder if we want to prevent users from setting this to 1 inadvertently.

meta-cla bot added the CLA Signed label Feb 19, 2026

xingbowang force-pushed the 2026_02_18_corruption_622561 branch 3 times, most recently from 9bcc95a to b72a991 Compare February 20, 2026 15:20

xingbowang marked this pull request as ready for review February 20, 2026 15:35

xingbowang force-pushed the 2026_02_18_corruption_622561 branch from b72a991 to de62d01 Compare February 20, 2026 17:19

jaykorean reviewed Feb 20, 2026

View reviewed changes

hx235 self-requested a review February 21, 2026 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add corruption guards against hardware bit flips#14354

Add corruption guards against hardware bit flips#14354
xingbowang wants to merge 1 commit intofacebook:mainfrom
xingbowang:2026_02_18_corruption_622561

xingbowang commented Feb 19, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Feb 20, 2026

Uh oh!

meta-codesync bot commented Feb 20, 2026

Uh oh!

jaykorean Feb 20, 2026

Uh oh!

jaykorean Feb 20, 2026

Uh oh!

jaykorean Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

xingbowang commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Feb 20, 2026

Uh oh!

meta-codesync bot commented Feb 20, 2026

Uh oh!

jaykorean Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

jaykorean Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

jaykorean Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xingbowang commented Feb 19, 2026 •

edited

Loading