[scrubber] phase1: add scrub manager#395
Conversation
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #395 +/- ##
==========================================
- Coverage 63.15% 55.37% -7.79%
==========================================
Files 32 39 +7
Lines 1900 6852 +4952
Branches 204 903 +699
==========================================
+ Hits 1200 3794 +2594
- Misses 600 2638 +2038
- Partials 100 420 +320 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
577567f to
e6e4ff8
Compare
64a49ef to
2a06901
Compare
e115d0e to
5787f25
Compare
Add comprehensive scrub infrastructure to detect data corruption and inconsistencies across replicas in HomeObject. This is phase 1 of the scrubber implementation. - Implements deep and shallow scrubbing for PG metadata, shards, and blobs - Supports periodic and manual scrub triggering modes - Uses priority queue (MPMCPriorityQueue) for scrub task scheduling - Persists scrub metadata using superblocks to track last scrub times - Coordinates scrub operations across all replicas in a PG 1. **Deep Scrub**: Full data integrity verification - PG metadata validation - Shard existence and consistency checks - Blob hash verification (reads data and computes checksums) - Detects corrupted, missing, and inconsistent data across replicas 2. **Shallow Scrub**: Lightweight metadata-only verification - Shard existence checks - Blob index validation (no data reads) - Faster execution for routine checks - FlatBuffer-based serialization for scrub requests and responses - Leader sends scrub requests to all replicas - Followers return scrub maps with their local state - Retry logic with configurable timeouts for reliability - **ShallowScrubReport**: Tracks missing shards and blobs per peer - **DeepScrubReport**: Extends shallow report with: - Corrupted blobs/shards with error details - Inconsistent blobs (different hashes across replicas) - Corrupted PG metadata - Scrubs data in configurable ranges to avoid timeouts - Shard range: 2M shards per request - Blob range: Based on HDD IOPS for deep scrub, 2M for shallow - Early cancellation support for graceful shutdown 1. **DeepScrubTest**: Verifies detection of: - Missing blobs on followers - Missing shards on followers - Corrupted blob data (IO errors) - Inconsistent blob hashes across replicas 2. **MPMCPriorityQueue Tests**: Lock-free queue validation - Concurrent push/pop operations - Priority ordering verification - Thread safety under contention
1d40033 to
d0c6b38
Compare
a7029da to
40463ac
Compare
xiaoxichen
left a comment
There was a problem hiding this comment.
The FBS part
Flatbuffer handles and skipped non-present fields , avoiding those fields be transmit in the wire, either "default" or "null" will be returned. With this, it is more nature to use a flat structure.
For example
table BlobScrubResult {
key: BlobKey;
code: ResultCode = 0;
hash:[ubyte] = null;
}
It can be used for both shallow and deep, with negligible overhead.
With this pattern, the FBS can be greatly simplificed.
we should only have one request type across board
ShardScrubReq {
scrub_info: ScrubInfo;
start: uint64;
end: uint64;
isDeep: bool;
}
also probably one response type ( if not, at least one response type per class(blob/shard/pg))
|
|
||
| table BlobScrubReq { | ||
| scrub_info: ScrubInfo; | ||
| start: uint64; |
There was a problem hiding this comment.
uint64 is used for Req however {ShardID, BlobID} is returned in Response, can you share more idea how this works? It seems like they need to match each other.
There was a problem hiding this comment.
when scrubbing blobs in a pg,
1 I get the last blob id, so I know the entire blob rang in this pg is [0, last_blob_id) ATM. I don`t care about the shard id here.
2 when handling a blob scrub req for a specific blob range [start, end],I only compare the blob_id in the blob_route_key ({ShardID, BlobID} ), if the blob_id is in range [start, end], this is a blob in the blob range of this scrub req and the whole blob_route_key is returned, which includes the shard_id.
There was a problem hiding this comment.
but within the index , the btree is order by <shardid, blobid>, so there is no nature ordering based on <blob_id> which means you would need to go through the whole btree for each request.
If this is the case, performance wise it is not acceptable
There was a problem hiding this comment.
I considered the performance impact of this behavior when writing this code. IMO,
1 since btree resides in nvme-ssd which has extremely low random io latency and high iops. walk through the whole index btree will not take too much time.
2 scrubbing is not a latency sensitive op , a little higher latency is acceptable, compared to the latency of rpc call and io on hdd. so, on my end, the latency of walking through the whole btree is acceptable.
3 implementation wise, using only blob_id to specify the blob range will make the whole logic very simple, straightforward and accurate. consider how to specify a blob rang accurately?. For example, the blob range size is 1000 for a scrub req, if we use [{shard_id_1, blob_id_1}, {shard_id_n, blob_id_n] to specify a blob range, then
when leader create the blob scrub req, it should:
1 exclude all the open shard between shard_id_1 and shard_id_n. if some shard state is wrong(should be sealed but open, shard meta blk bitrot can lead to this), then we will wrongly skip scrubbing all the blobs of this shard.
2 exclude all the sealed shard , the sealed_lsn of which is larger then the current commit lsn.
3 add blob one by one until the total count reach to 100. this step will also be very time-consuming since we need to get the blob count of each shard, which will involve a lot of single query.
4 the worst case is, if there is a shard_id_0, which is missing one leader, all the blobs in shard_id_0 will never be scrubbed, since leader can not see this shard. in this case, missing shard on leader can lead to the loss of blob scrubbing for this shard.
and there is still many other cases will make the process of generating the blob range very complicated.
There was a problem hiding this comment.
in your previous calculation ,worst case if a PG is filled up by 8KB blobs (1TB / 8KB = 128Million records). Walking through the B tree takes ~ 100s even assuming everything in memory (typical CPU core can do 1M *O(1) per second).
Then , if we think scrubbing is not latency sensitive, and acknowledge the complexity of multiple ranges, can we limit each round trip within one shard? Yes it will definitely involves more round-trips, but 1s we can do 30 rounds without any CPU pressure.
There was a problem hiding this comment.
typical CPU core can do 1M *O(1) per second
I doubt this, typical cpu core should be much faster than this in my mind. following is the answer from gpt
For a “typical” modern CPU core, a plain O(1) uint64 comparison (a < b, a == b) can be done on the order of hundreds of millions to a few billions per second—not just 1 million
A practical rule of thumb (single core, ~3 GHz):
Best case (register-to-register compare, tight loop, branchless):
~1–3 × 10⁹ comparisons/sec/core (billions/sec)
Common case (streaming through arrays, mostly L1/L2 hits):
~3 × 10⁸ to 1 × 10⁹ comparisons/sec/core
Worst case (random memory access, frequent cache misses):
can drop to ~1 × 10⁷ to 1 × 10⁸ comparisons/sec/core (tens of millions/sec), because you’re waiting on memory, not the compare instruction.
So the compare itself is extremely cheap; performance is usually dominated by memory access, branching/predictability, and loop overhead.
we limit each round trip within one shard
1 in the worst case , if there is only one blob in each shard, we will have a huge number of round-trips for 128Million blob. so if there are too many small shard, we will have to schedule too many round trip.
2 if one shard is lost(shard meta blk is lost) on leader , all the blobs in this shard will not be scrubbed.
actually , this is why I proposed the idea of scrubbing the entire pg vchunk by vchunk in my earlier design. this will solve all the problems we met here. a vchunk can limit the amount of shard and blobs to a acceptable batch size in a single scrub req. and the number of round trip for a pg scrubbing is fixed since the number of vchunks in a pg is fixed.
the only concern of vchunk scrubbing is ramdom io on hdd for each single blob.
2GB data in a chunk, 4K (min blob size) random read, 7200 RPM HDD ~200 IOPS
// Total read operations: 2GB / 4KB = 524,288 reads
// Estimated time: 524,288 / 200 = 5,243 seconds ≈ 44minutes
if we sort all blobs in the vchunk by pba to mimic a sequentially read, seems this will significantly reduce the time consuming? do we need to change back to vchunk based scrubbing?
There was a problem hiding this comment.
I dont want to argue with GPT as the btree access pattern is far different from pure memory access as well as there are locking invloved . You can have a POC to prove for a 128Million record Btree how long you can traverse it. We can based on the number to discuss.
- Yes it is super slow but no CPU/MEM/IO pressure.
- we check shard consistency before going into blob consistency check --- if there is a shard missing we knows there.
- I am fine to use vchunk based scrubbing however it is fully depends on how we break one task (full vchunk) into smaller batches on follower side . We cannot submit 256K IOs, even they are sequential , to the disk at once, letting the disk read at 200MB/s for 10s that will add up latency for other request , especially write, by 10s. A reasonable batch might be 4MB/8MB(20-40ms added) with proper sleep in between.
So the bound we are solving are:
- Maximum batch size (regardless it is shard/vchunk) 256K blobs (8K per blob).
- Maximum shard count , literary infinity after we remove shard_header from disk, reasonable upper limit: 256K.
Combined both , we need 1) split huge task(shard/vchunk) into small unit to avoid impacting foreground traffic. 2) similar to 1), but reasonable execute time of a task to allow timeout-based retry. 3) effective consolidate K shards in single RT, for less message.
Tools we have:
- the flatbuffer is efficient for default value --- we can have {shardid, blobid_low, blobiid_high} as a task unit.
- we can carry a list of task units in single request
- Potentially if we let leader scrub first, it knows the size of each blob so it can controls the size per Req, to the best extend.
| using BlobHashArray = std::array< uint8_t, blob_max_hash_len >; | ||
| using chunk_id_t = homestore::chunk_num_t; | ||
| // TODO: persist this into metablk. | ||
| inline static atomic_uint64_t scrub_task_id{1}; |
There was a problem hiding this comment.
It is subject to design but UUID usually prevent collision.
|
|
||
| table BlobScrubReq { | ||
| scrub_info: ScrubInfo; | ||
| start: uint64; |
There was a problem hiding this comment.
in your previous calculation ,worst case if a PG is filled up by 8KB blobs (1TB / 8KB = 128Million records). Walking through the B tree takes ~ 100s even assuming everything in memory (typical CPU core can do 1M *O(1) per second).
Then , if we think scrubbing is not latency sensitive, and acknowledge the complexity of multiple ranges, can we limit each round trip within one shard? Yes it will definitely involves more round-trips, but 1s we can do 30 rounds without any CPU pressure.
| LOGINFOMOD(scrubmgr, "submit a scrub task for pg={}, deep_scrub:{}", pg_id, is_deep); | ||
|
|
||
| // Check if a scrub task is already running for this PG | ||
| // Note: There's still a small race window between this check and task execution in handle_pg_scrub_task, |
There was a problem hiding this comment.
There is no expectation that the Q will be empty in most of time, i.e there is no expectation that a task will turn in to running state soon after it is added into the Q. Quite possible, if there are some deep-scrub tasks, it will block the Q for long time.
With this setting, then it is NOT a small racing window.. duplicate tasks for same PG can be added into the Q repeatedly, till the first task get executed.
What is worse, because the Q is ordered by task_id instead of last_scrubbed_timestamp, the repeated task, if any is not yet populated, will have highest priority.
For example:
T0: PG1 deep scrub, PG2 deep scrub
T1: PG3 added(tid: 3) , PG4 added (tid:4)
T2: PG3 added(tid:5) , PG4 added (tid:6_
.....
T10: PG5 added (tid:21)
....
T100: PG3 picked (tid=3 ), deep-scrubbing.
T101: PG4 picked (tid=4), deep-scrubbing
....
T 201: PG3 picked again (tid=5)
Potentially OOM risk as well
There was a problem hiding this comment.
from line 557 to line 569
// Check if pg_state is HEALTHY (state must be 0)
if (!force) {
const auto current_state = hs_pg->pg_state_.get();
if (current_state != 0) {
LOGWARNMOD(scrubmgr, "pg={} is not in HEALTHY state (current_state={}), cannot submit scrub task!", pg_id,
current_state);
return folly::makeFuture(std::shared_ptr< ScrubManager::ShallowScrubReport >(nullptr));
}
// Set SCRUBBING state
hs_pg->pg_state_.set_state(PGStateMask::SCRUBBING);
LOGINFOMOD(scrubmgr, "set SCRUBBING state for pg={}", pg_id);
}
before push scrub task to the pg task queue for a specific pg, I will check the state of this pg. if the state of this PG is not heathy only(i.e. , has state of BASELINE_RESYNC or SCRUBBING) , I will skip scrubbing this pg.
if the state is healthy only, I will add a state of SCRUBBING to the state of this pg.
then , if we want add a duplicated scrub task for this pg, the state check will fail. this can guarantee that there is at most only one scrub task for a pg, either pending for scrubbing(in the task queue) , or being scrubbing(task is being executed). the SCRUBBING state will be removed after the scrub task is completed
There was a problem hiding this comment.
I don`t implement automatically schedule for pg scrub in phase, all the pg scrub will be triggered manually by http api ATM.
in phase 2 , I will add schedule logic for automatically scrubbing and change the compare function (operator<)
There was a problem hiding this comment.
Please do not set PG state to SCRUBBING if they are not actually in the middle of scrubbing. It is misleading.
There was a problem hiding this comment.
sure, I will add a new state pending_scrubbing later.
There was a problem hiding this comment.
The state will be populated to CM then to dashboard. The problem is we need de-dup to ensure no dup in the Q, the implementation should avoid using state as much as possible to limit the change within HO.
There was a problem hiding this comment.
if pg has one of the following state, we should not scrub this pg.
DISK_DOWN, BASELINE_RESYNC, RESYNCING, REPAIR, so pg state are useful for scrubbing.
coming to the state of SCRUBBING, I suggest to use it like this for now, even if it is a little confusing. we can later have a separate PR to add a separate to describe pending scrubbing state with the change of cm if necessary
There was a problem hiding this comment.
Feel free to READ any PG state, but , need strong reason to WRITE an PG state as it is external visible.
I would prefer we change it now by moving the setting of SCRUBBING to the start of scrubbing task. This ensures external correctness. Then, with the assumption that we dont have automatically pg scrub scheduling, the racing window is small enough due to
- the Q is expected to be short.
- only human can trigger GC via api.
With the reasoning, we can add a fixme, and address the de-dedup question in next PR.
| // TODO:: add more logic to decide the priority between two tasks after we introduce more logic for | ||
| // automatic schedule, the following are some criteria we can consider: | ||
| /* | ||
| 2. Time Since Last Scrub |
There was a problem hiding this comment.
Hope this can be the default policy
There was a problem hiding this comment.
sure, we can add more logic to operator< to achieve this goal.
| const auto& blob_id = *r_cast< blob_id_t const* >(key.cbytes()); | ||
| LOGD("Received del_blob pre-commit for pg={}, shard=0x{:x}, blob_id={}, lsn={}", pg_id, shard_id, blob_id, lsn); | ||
|
|
||
| if (scrub_mgr_) { scrub_mgr_->add_pg_deleted_blob(pg_id, {shard_id, blob_id}, lsn); } |
There was a problem hiding this comment.
-
all deleted blobs will accumulated in memory, the memory pressure is concerning.
-
I am not getting the point of paying some extra overheads in every delete (which is executed in single threaded state-machine), to trade-off a low-chance re-verification when mismatch happens.
-
it is possible that we reconcile on each batch (blob id ranges) instead of reconcile at the end of scurb_task, the memory concerns can be mitigated however the overhead Just dep on homestore. #2 is still there.
-
I really dont want to think about possible roll-back/BR scenarios.
Suggestion:
The racing only happens when the deleting blobID collides into the current scrubbing blob range, it is relatively lower chance. We can reconcile by spot-check this blobID across 3 replicas, either at the end of each batch, or at the end of whole task(preferred, as delay helps with consistency)
There was a problem hiding this comment.
1 extra overhead is very small and negligible, it is a memory only op, just add a blob_id into a map.
2 if rollback / br happens , the scrub task will be cancelled and the task context will be deleted, so we don`t need any rollback for this. after the scrub_task is completed, on_blob_delete_pre_commit will do nothing.
3 the only possible concerning part is memory pressure. we can have a strategy, if there are too many blobs (exceed a pre-set limitation )are deleted during pg scrubbing, we cancel this scrub task, and redo it again? then the occupied memory will be freed when the scrub task context is deleted.
4 i also considered to reconcile at the end of whole task. however, since the commit lsn is different in different replica, this will be not that accurate. delete_blob log probably has been committed on some replicas, but not on others.
let`s think about this again.
There was a problem hiding this comment.
If there is a deletion happens in LSN X, and you check on LSN Y (after whole task),
-
if all member had been deleted, then we are good, you wont care whether in LSN X' where X<X' <Y, there is an inconsistency.
-
If there is a member still have the blob, then it is problematic for any X' <Y.
In both cases, checking at the end is correct.
There was a problem hiding this comment.
If there is a member still have the blob, then it is problematic for any X' <Y.
I do not follow this. do you mean when checking on LSN Y, all the member should wait until LSN Y is committed and see if I have the blob locally?
There was a problem hiding this comment.
no, I just means if a blob is still exists at the end of of the task, it exists in anytime during the task... and sleep/retry will solve 99.9999% problems.
There was a problem hiding this comment.
ok, got it, I will make this change
| uint64_t last_deep_scrub_timestamp; | ||
| uint64_t last_shallow_scrub_timestamp; |
There was a problem hiding this comment.
can this be added into PG superblk
There was a problem hiding this comment.
this need a version upgrade of pg meta blk, since this will be backward incompitable. we can use a separate pr to do thsi
There was a problem hiding this comment.
do you want to land that PR prior to scrubber ? or after scrubber landing?
|
|
||
| // pg scrub superblk | ||
| #pragma pack(1) | ||
| struct pg_scrub_superblk { |
There was a problem hiding this comment.
I think the PG superblk is more for recovery of in-progress scrubbing, mostly , the deep-scrub. It is fine we forget other stage (like shard scrubbing) , but once in blob scrubbing stage, progress should be persisted (optionally every N batches) to allow restoration.
There was a problem hiding this comment.
I suggest not to record the progress of a scrub task just like ceph, a new scrub stask will be scheduled.
in homeobject, a task will be cancelled if any error happens when scrubbing. for example , leader change.
There was a problem hiding this comment.
I am not sure if ceph is a good example as we dont have a successful story to do deep-scrub once a month, on 4TB drives, now we are running at 20TB drives...
Anyway, fine to postpone it till necessary.
|
Another general comments, can you turn your design doc into ADR and push with the code? The ADR will be updated as code implementing |
this pr implements the framwork and basic logic of scrubber, including:
1 thread model
2 scrubber rpc
3 local scrub: deep and shallow scrub for pg, shard and blob
later PR will :
1 add http interface to trigger pg scrub
2 add more UT to cover more case
3 do optimization