Add corruption guards against hardware bit flips#14354
Add corruption guards against hardware bit flips#14354xingbowang wants to merge 1 commit intofacebook:mainfrom
Conversation
9bcc95a to
b72a991
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D93868104. |
Summary: Investigation of a production SST corruption revealed that a single hardware bit flip (bit 32) in a value Slice's size_ field during compaction caused a 5.6MB value to be treated as 4.3GB. The corruption was silent because: - BlockBuilder's varint encoding uses static_cast<uint32_t>(value.size()), which truncated the corrupted size back to the correct value - buffer_.append(value.data(), value.size()) used the full 64-bit corrupted size, appending 4GB of adjacent heap memory into the block - The block checksum was computed over the corrupted data, so it matched - paranoid_file_checks was disabled This change adds 2 layers of defense: 1. **Value size uint32 truncation guard**: In BlockBasedTableBuilder::Add(), detect when value.size() exceeds uint32_t range before passing it to BlockBuilder where the varint encoding would silently truncate it. Returns Status::Corruption so the flush/compaction can abort gracefully without crashing the process. This applies to both flush and compaction paths since all key-value pairs flow through BlockBasedTableBuilder::Add(). 2. **Compaction output/input size ratio check**: New option max_compaction_output_to_input_ratio (default: 10, 0 to disable) in MutableCFOptions. After compaction, if total output size exceeds this ratio times total input size, return Status::Corruption. This catches cases where corrupted values inflate the output file far beyond what input data justifies. This check applies to compaction only (not flush), since flush writes from a memtable and does not have input files for comparison. However, flush output is still protected by guards facebook#1 and facebook#2. Additionally, both the flush loop (db/builder.cc) and the compaction output loop (db/compaction/compaction_outputs.cc) now check builder->status() after each Add() call. Previously, a corruption error set inside the builder was only surfaced when Finish() was called, causing the loop to continue iterating through potentially millions of keys doing wasted work. Test Plan: Unit test
b72a991 to
de62d01
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D93868104. |
|
|
||
| // Count consecutive transient in-memory corruption errors (e.g., hardware | ||
| // bit flips). On first occurrence, allow retry. On second, escalate to fatal. | ||
| int transient_corruption_retry_count_ = 0; |
There was a problem hiding this comment.
So if we run into this corruption in one compaction then while retrying if we detect another corruption from another compaction (or flush), we will give up I guess (even though retry may go through for both).
| if (log_io_s.ok()) { | ||
| if (ShouldRetryTransientCorruption(s, cfd->GetName().c_str())) { | ||
| // Transient corruption: memtable has been rolled back and | ||
| // imm_flush_needed is set, so the flush will be rescheduled. |
There was a problem hiding this comment.
Asking for my own understanding - where are we setting imm_flush_needed back to true? Or do we call RollbackMemtableFlush() in this flow?
| // Default: 10 (output cannot exceed 10x input size) | ||
| // | ||
| // Dynamically changeable through SetOptions() API. | ||
| uint64_t max_compaction_output_to_input_ratio = 10; |
There was a problem hiding this comment.
nit: I wonder if we want to prevent users from setting this to 1 inadvertently.
Summary:
Investigation of a production SST corruption revealed that a single hardware bit flip (bit 32) in a value Slice's size_ field during compaction caused a 5.6MB value to be treated as 4.3GB. The corruption was silent because:
This change adds 2 layers of defense:
BlockBuilder uint32 truncation guard: In AddWithLastKeyImpl, detect when value.size() exceeds uint32_t range before the varint encoding silently truncates it. This catches bit flips in bits 32-63 and aborts immediately, preventing the massive buffer_.append from occurring. This applies to both flush and compaction paths since all key-value pairs flow through BlockBuilder::AddWithLastKeyImpl.
Compaction output/input size ratio check: New option max_compaction_output_to_input_ratio (default: 10, 0 to disable) in MutableCFOptions. After compaction, if total output size exceeds this ratio times total input size, return Status::Corruption. This catches cases where corrupted values inflate the output file far beyond what input data justifies. This check applies to compaction only (not flush), since flush writes from a memtable and does not have input files for comparison. However, flush output is still protected by guards Miss Spelling in README #1 and OSX compilation #2.
Test Plan:
Unit test