[scrubber] phase1: add scrub manager by JacksonYao287 · Pull Request #395 · eBay/HomeObject

JacksonYao287 · 2026-03-08T10:32:26Z

this pr implements the framwork and basic logic of scrubber, including:
1 thread model
2 scrubber rpc
3 local scrub: deep and shallow scrub for pg, shard and blob

later PR will :
1 add http interface to trigger pg scrub
2 add more UT to cover more case
3 do optimization

codecov-commenter · 2026-03-08T15:35:48Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 59.54151% with 653 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.37%. Comparing base (1746bcc) to head (40463ac).
⚠️ Report is 168 commits behind head on main.

Files with missing lines	Patch %	Lines
src/lib/homestore_backend/scrub_manager.cpp	67.67%	236 Missing and 94 partials ⚠️
src/lib/homestore_backend/hs_http_manager.cpp	0.45%	217 Missing ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp	61.33%	38 Missing and 20 partials ⚠️
src/lib/homestore_backend/hs_http_manager.hpp	0.00%	20 Missing ⚠️
src/lib/homestore_backend/scrub_manager.hpp	89.62%	11 Missing ⚠️
...ib/homestore_backend/replication_state_machine.cpp	68.42%	6 Missing ⚠️
src/lib/homestore_backend/hs_blob_manager.cpp	54.54%	2 Missing and 3 partials ⚠️
src/lib/homestore_backend/MPMCPriorityQueue.hpp	94.59%	1 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_homeobject.cpp	81.81%	0 Missing and 2 partials ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp	86.66%	2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #395      +/-   ##
==========================================
- Coverage   63.15%   55.37%   -7.79%     
==========================================
  Files          32       39       +7     
  Lines        1900     6852    +4952     
  Branches      204      903     +699     
==========================================
+ Hits         1200     3794    +2594     
- Misses        600     2638    +2038     
- Partials      100      420     +320

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add comprehensive scrub infrastructure to detect data corruption and inconsistencies across replicas in HomeObject. This is phase 1 of the scrubber implementation. - Implements deep and shallow scrubbing for PG metadata, shards, and blobs - Supports periodic and manual scrub triggering modes - Uses priority queue (MPMCPriorityQueue) for scrub task scheduling - Persists scrub metadata using superblocks to track last scrub times - Coordinates scrub operations across all replicas in a PG 1. **Deep Scrub**: Full data integrity verification - PG metadata validation - Shard existence and consistency checks - Blob hash verification (reads data and computes checksums) - Detects corrupted, missing, and inconsistent data across replicas 2. **Shallow Scrub**: Lightweight metadata-only verification - Shard existence checks - Blob index validation (no data reads) - Faster execution for routine checks - FlatBuffer-based serialization for scrub requests and responses - Leader sends scrub requests to all replicas - Followers return scrub maps with their local state - Retry logic with configurable timeouts for reliability - **ShallowScrubReport**: Tracks missing shards and blobs per peer - **DeepScrubReport**: Extends shallow report with: - Corrupted blobs/shards with error details - Inconsistent blobs (different hashes across replicas) - Corrupted PG metadata - Scrubs data in configurable ranges to avoid timeouts - Shard range: 2M shards per request - Blob range: Based on HDD IOPS for deep scrub, 2M for shallow - Early cancellation support for graceful shutdown 1. **DeepScrubTest**: Verifies detection of: - Missing blobs on followers - Missing shards on followers - Corrupted blob data (IO errors) - Inconsistent blob hashes across replicas 2. **MPMCPriorityQueue Tests**: Lock-free queue validation - Concurrent push/pop operations - Priority ordering verification - Thread safety under contention

xiaoxichen

The FBS part

Flatbuffer handles and skipped non-present fields , avoiding those fields be transmit in the wire, either "default" or "null" will be returned. With this, it is more nature to use a flat structure.

For example

table BlobScrubResult {
   key: BlobKey;
   code: ResultCode = 0;
    hash:[ubyte] = null;
}

It can be used for both shallow and deep, with negligible overhead.

With this pattern, the FBS can be greatly simplificed.

we should only have one request type across board

ShardScrubReq {
  scrub_info: ScrubInfo;
  start: uint64;
  end: uint64;
  isDeep: bool;
}

also probably one response type ( if not, at least one response type per class(blob/shard/pg))

xiaoxichen · 2026-04-07T04:36:13Z

+
+table BlobScrubReq {
+  scrub_info: ScrubInfo;
+  start: uint64;


uint64 is used for Req however {ShardID, BlobID} is returned in Response, can you share more idea how this works? It seems like they need to match each other.

when scrubbing blobs in a pg,
1 I get the last blob id, so I know the entire blob rang in this pg is [0, last_blob_id) ATM. I don`t care about the shard id here.

2 when handling a blob scrub req for a specific blob range [start, end]，I only compare the blob_id in the blob_route_key ({ShardID, BlobID} ), if the blob_id is in range [start, end], this is a blob in the blob range of this scrub req and the whole blob_route_key is returned, which includes the shard_id.

but within the index , the btree is order by <shardid, blobid>, so there is no nature ordering based on <blob_id> which means you would need to go through the whole btree for each request.

If this is the case, performance wise it is not acceptable

I considered the performance impact of this behavior when writing this code. IMO,

1 since btree resides in nvme-ssd which has extremely low random io latency and high iops. walk through the whole index btree will not take too much time.

2 scrubbing is not a latency sensitive op , a little higher latency is acceptable, compared to the latency of rpc call and io on hdd. so, on my end, the latency of walking through the whole btree is acceptable.

3 implementation wise, using only blob_id to specify the blob range will make the whole logic very simple, straightforward and accurate. consider how to specify a blob rang accurately?. For example, the blob range size is 1000 for a scrub req, if we use [{shard_id_1, blob_id_1}, {shard_id_n, blob_id_n] to specify a blob range, then

when leader create the blob scrub req, it should:
1 exclude all the open shard between shard_id_1 and shard_id_n. if some shard state is wrong(should be sealed but open, shard meta blk bitrot can lead to this), then we will wrongly skip scrubbing all the blobs of this shard.
2 exclude all the sealed shard , the sealed_lsn of which is larger then the current commit lsn.
3 add blob one by one until the total count reach to 100. this step will also be very time-consuming since we need to get the blob count of each shard, which will involve a lot of single query.
4 the worst case is, if there is a shard_id_0, which is missing one leader, all the blobs in shard_id_0 will never be scrubbed, since leader can not see this shard. in this case, missing shard on leader can lead to the loss of blob scrubbing for this shard.

and there is still many other cases will make the process of generating the blob range very complicated.

in your previous calculation ,worst case if a PG is filled up by 8KB blobs (1TB / 8KB = 128Million records). Walking through the B tree takes ~ 100s even assuming everything in memory (typical CPU core can do 1M *O(1) per second).

Then , if we think scrubbing is not latency sensitive, and acknowledge the complexity of multiple ranges, can we limit each round trip within one shard? Yes it will definitely involves more round-trips, but 1s we can do 30 rounds without any CPU pressure.

typical CPU core can do 1M *O(1) per second

I doubt this, typical cpu core should be much faster than this in my mind. following is the answer from gpt

For a “typical” modern CPU core, a plain O(1) uint64 comparison (a < b, a == b) can be done on the order of hundreds of millions to a few billions per second—not just 1 million A practical rule of thumb (single core, ~3 GHz): Best case (register-to-register compare, tight loop, branchless): ~1–3 × 10⁹ comparisons/sec/core (billions/sec) Common case (streaming through arrays, mostly L1/L2 hits): ~3 × 10⁸ to 1 × 10⁹ comparisons/sec/core Worst case (random memory access, frequent cache misses): can drop to ~1 × 10⁷ to 1 × 10⁸ comparisons/sec/core (tens of millions/sec), because you’re waiting on memory, not the compare instruction. So the compare itself is extremely cheap; performance is usually dominated by memory access, branching/predictability, and loop overhead.

we limit each round trip within one shard

1 in the worst case , if there is only one blob in each shard, we will have a huge number of round-trips for 128Million blob. so if there are too many small shard, we will have to schedule too many round trip.

2 if one shard is lost(shard meta blk is lost) on leader , all the blobs in this shard will not be scrubbed.

actually , this is why I proposed the idea of scrubbing the entire pg vchunk by vchunk in my earlier design. this will solve all the problems we met here. a vchunk can limit the amount of shard and blobs to a acceptable batch size in a single scrub req. and the number of round trip for a pg scrubbing is fixed since the number of vchunks in a pg is fixed.

the only concern of vchunk scrubbing is ramdom io on hdd for each single blob.

2GB data in a chunk, 4K (min blob size) random read, 7200 RPM HDD ~200 IOPS // Total read operations: 2GB / 4KB = 524,288 reads // Estimated time: 524,288 / 200 = 5,243 seconds ≈ 44minutes

if we sort all blobs in the vchunk by pba to mimic a sequentially read, seems this will significantly reduce the time consuming? do we need to change back to vchunk based scrubbing?

I dont want to argue with GPT as the btree access pattern is far different from pure memory access as well as there are locking invloved . You can have a POC to prove for a 128Million record Btree how long you can traverse it. We can based on the number to discuss.

Yes it is super slow but no CPU/MEM/IO pressure.

we check shard consistency before going into blob consistency check --- if there is a shard missing we knows there.

I am fine to use vchunk based scrubbing however it is fully depends on how we break one task (full vchunk) into smaller batches on follower side . We cannot submit 256K IOs, even they are sequential , to the disk at once, letting the disk read at 200MB/s for 10s that will add up latency for other request , especially write, by 10s. A reasonable batch might be 4MB/8MB(20-40ms added) with proper sleep in between.

So the bound we are solving are:

Maximum batch size (regardless it is shard/vchunk) 256K blobs (8K per blob).

Maximum shard count , literary infinity after we remove shard_header from disk, reasonable upper limit: 256K.

Combined both , we need 1) split huge task(shard/vchunk) into small unit to avoid impacting foreground traffic. 2) similar to 1), but reasonable execute time of a task to allow timeout-based retry. 3) effective consolidate K shards in single RT, for less message.

Tools we have:

the flatbuffer is efficient for default value --- we can have {shardid, blobid_low, blobiid_high} as a task unit.

we can carry a list of task units in single request

Potentially if we let leader scrub first, it knows the size of each blob so it can controls the size per Req, to the best extend.

xiaoxichen · 2026-04-07T07:10:00Z

+    using BlobHashArray = std::array< uint8_t, blob_max_hash_len >;
+    using chunk_id_t = homestore::chunk_num_t;
+    // TODO: persist this into metablk.
+    inline static atomic_uint64_t scrub_task_id{1};


It is subject to design but UUID usually prevent collision.

xiaoxichen · 2026-04-07T08:58:43Z

+
+table BlobScrubReq {
+  scrub_info: ScrubInfo;
+  start: uint64;


in your previous calculation ,worst case if a PG is filled up by 8KB blobs (1TB / 8KB = 128Million records). Walking through the B tree takes ~ 100s even assuming everything in memory (typical CPU core can do 1M *O(1) per second).

Then , if we think scrubbing is not latency sensitive, and acknowledge the complexity of multiple ranges, can we limit each round trip within one shard? Yes it will definitely involves more round-trips, but 1s we can do 30 rounds without any CPU pressure.

xiaoxichen · 2026-04-07T09:22:45Z

+    LOGINFOMOD(scrubmgr, "submit a scrub task for pg={}, deep_scrub:{}", pg_id, is_deep);
+
+    // Check if a scrub task is already running for this PG
+    // Note: There's still a small race window between this check and task execution in handle_pg_scrub_task,


There is no expectation that the Q will be empty in most of time, i.e there is no expectation that a task will turn in to running state soon after it is added into the Q. Quite possible, if there are some deep-scrub tasks, it will block the Q for long time.

With this setting, then it is NOT a small racing window.. duplicate tasks for same PG can be added into the Q repeatedly, till the first task get executed.

What is worse, because the Q is ordered by task_id instead of last_scrubbed_timestamp, the repeated task, if any is not yet populated, will have highest priority.

For example:

T0: PG1 deep scrub, PG2 deep scrub T1: PG3 added(tid: 3) , PG4 added (tid:4) T2: PG3 added(tid:5) , PG4 added (tid:6_ ..... T10: PG5 added (tid:21) .... T100: PG3 picked (tid=3 ), deep-scrubbing. T101: PG4 picked (tid=4), deep-scrubbing .... T 201: PG3 picked again (tid=5)

Potentially OOM risk as well

from line 557 to line 569

// Check if pg_state is HEALTHY (state must be 0) if (!force) { const auto current_state = hs_pg->pg_state_.get(); if (current_state != 0) { LOGWARNMOD(scrubmgr, "pg={} is not in HEALTHY state (current_state={}), cannot submit scrub task!", pg_id, current_state); return folly::makeFuture(std::shared_ptr< ScrubManager::ShallowScrubReport >(nullptr)); } // Set SCRUBBING state hs_pg->pg_state_.set_state(PGStateMask::SCRUBBING); LOGINFOMOD(scrubmgr, "set SCRUBBING state for pg={}", pg_id); }

before push scrub task to the pg task queue for a specific pg, I will check the state of this pg. if the state of this PG is not heathy only(i.e. , has state of BASELINE_RESYNC or SCRUBBING) , I will skip scrubbing this pg.

if the state is healthy only, I will add a state of SCRUBBING to the state of this pg.

then , if we want add a duplicated scrub task for this pg, the state check will fail. this can guarantee that there is at most only one scrub task for a pg, either pending for scrubbing(in the task queue) , or being scrubbing(task is being executed). the SCRUBBING state will be removed after the scrub task is completed

I don`t implement automatically schedule for pg scrub in phase, all the pg scrub will be triggered manually by http api ATM.

in phase 2 , I will add schedule logic for automatically scrubbing and change the compare function (operator<)

Please do not set PG state to SCRUBBING if they are not actually in the middle of scrubbing. It is misleading.

sure, I will add a new state pending_scrubbing later.

The state will be populated to CM then to dashboard. The problem is we need de-dup to ensure no dup in the Q, the implementation should avoid using state as much as possible to limit the change within HO.

if pg has one of the following state, we should not scrub this pg.
DISK_DOWN, BASELINE_RESYNC, RESYNCING, REPAIR, so pg state are useful for scrubbing.

coming to the state of SCRUBBING, I suggest to use it like this for now, even if it is a little confusing. we can later have a separate PR to add a separate to describe pending scrubbing state with the change of cm if necessary

Feel free to READ any PG state, but , need strong reason to WRITE an PG state as it is external visible.

I would prefer we change it now by moving the setting of SCRUBBING to the start of scrubbing task. This ensures external correctness. Then, with the assumption that we dont have automatically pg scrub scheduling, the racing window is small enough due to

the Q is expected to be short.

only human can trigger GC via api.

With the reasoning, we can add a fixme, and address the de-dedup question in next PR.

xiaoxichen · 2026-04-07T09:27:37Z

+            // TODO:: add more logic to decide the priority between two tasks after we introduce more logic for
+            // automatic schedule, the following are some criteria we can consider:
+            /*
+            2. Time Since Last Scrub


Hope this can be the default policy

sure, we can add more logic to operator< to achieve this goal.

xiaoxichen · 2026-04-07T09:42:33Z

+    const auto& blob_id = *r_cast< blob_id_t const* >(key.cbytes());
+    LOGD("Received del_blob pre-commit for pg={}, shard=0x{:x}, blob_id={}, lsn={}", pg_id, shard_id, blob_id, lsn);
+
+    if (scrub_mgr_) { scrub_mgr_->add_pg_deleted_blob(pg_id, {shard_id, blob_id}, lsn); }


all deleted blobs will accumulated in memory, the memory pressure is concerning.

I am not getting the point of paying some extra overheads in every delete (which is executed in single threaded state-machine), to trade-off a low-chance re-verification when mismatch happens.

it is possible that we reconcile on each batch (blob id ranges) instead of reconcile at the end of scurb_task, the memory concerns can be mitigated however the overhead Just dep on homestore. #2 is still there.

I really dont want to think about possible roll-back/BR scenarios.

Suggestion:
The racing only happens when the deleting blobID collides into the current scrubbing blob range, it is relatively lower chance. We can reconcile by spot-check this blobID across 3 replicas, either at the end of each batch, or at the end of whole task(preferred, as delay helps with consistency)

1 extra overhead is very small and negligible, it is a memory only op, just add a blob_id into a map.

2 if rollback / br happens , the scrub task will be cancelled and the task context will be deleted, so we don`t need any rollback for this. after the scrub_task is completed, on_blob_delete_pre_commit will do nothing.

3 the only possible concerning part is memory pressure. we can have a strategy, if there are too many blobs (exceed a pre-set limitation )are deleted during pg scrubbing, we cancel this scrub task, and redo it again? then the occupied memory will be freed when the scrub task context is deleted.

4 i also considered to reconcile at the end of whole task. however, since the commit lsn is different in different replica, this will be not that accurate. delete_blob log probably has been committed on some replicas, but not on others.

let`s think about this again.

If there is a deletion happens in LSN X, and you check on LSN Y (after whole task),

if all member had been deleted, then we are good, you wont care whether in LSN X' where X<X' <Y, there is an inconsistency.

If there is a member still have the blob, then it is problematic for any X' <Y.

In both cases, checking at the end is correct.

If there is a member still have the blob, then it is problematic for any X' <Y.

I do not follow this. do you mean when checking on LSN Y, all the member should wait until LSN Y is committed and see if I have the blob locally?

no, I just means if a blob is still exists at the end of of the task, it exists in anytime during the task... and sleep/retry will solve 99.9999% problems.

ok, got it, I will make this change

xiaoxichen · 2026-04-07T09:52:03Z

+        uint64_t last_deep_scrub_timestamp;
+        uint64_t last_shallow_scrub_timestamp;


can this be added into PG superblk

this need a version upgrade of pg meta blk, since this will be backward incompitable. we can use a separate pr to do thsi

do you want to land that PR prior to scrubber ? or after scrubber landing?

xiaoxichen · 2026-04-07T09:56:10Z

+
+    // pg scrub superblk
+#pragma pack(1)
+    struct pg_scrub_superblk {


I think the PG superblk is more for recovery of in-progress scrubbing, mostly , the deep-scrub. It is fine we forget other stage （like shard scrubbing) , but once in blob scrubbing stage, progress should be persisted (optionally every N batches) to allow restoration.

I suggest not to record the progress of a scrub task just like ceph, a new scrub stask will be scheduled.
in homeobject, a task will be cancelled if any error happens when scrubbing. for example , leader change.

I am not sure if ceph is a good example as we dont have a successful story to do deep-scrub once a month, on 4TB drives, now we are running at 20TB drives...

Anyway, fine to postpone it till necessary.

xiaoxichen · 2026-04-07T16:21:14Z

Another general comments, can you turn your design doc into ADR and push with the code?

The ADR will be updated as code implementing

JacksonYao287 force-pushed the add-scrub-manager branch 3 times, most recently from 577567f to e6e4ff8 Compare March 11, 2026 08:37

JacksonYao287 requested review from Besroy, Hooper9973, koujl, xiaoxichen, yuwmao and zhiteng March 11, 2026 08:38

JacksonYao287 force-pushed the add-scrub-manager branch 3 times, most recently from 64a49ef to 2a06901 Compare March 20, 2026 08:36

JacksonYao287 force-pushed the add-scrub-manager branch 2 times, most recently from e115d0e to 5787f25 Compare March 22, 2026 23:48

JacksonYao287 added 3 commits March 31, 2026 17:04

add http interface for scrub

27b08aa

filter out deleted blobs when scrubbing

3b62d42

JacksonYao287 force-pushed the add-scrub-manager branch 2 times, most recently from 1d40033 to d0c6b38 Compare April 1, 2026 08:26

fix bug and add UT

40463ac

JacksonYao287 force-pushed the add-scrub-manager branch from a7029da to 40463ac Compare April 3, 2026 06:37

xiaoxichen requested changes Apr 7, 2026

View reviewed changes

Comment thread src/lib/homestore_backend/hs_homeobject_fbs/deep_blob_scrub_map.fbs

Comment thread src/lib/homestore_backend/hs_homeobject_fbs/meta_scrub_req.fbs

xiaoxichen reviewed Apr 7, 2026

View reviewed changes

Comment thread src/lib/homestore_backend/hs_pg_manager.cpp

xiaoxichen reviewed Apr 7, 2026

View reviewed changes

		uint64_t last_deep_scrub_timestamp;
		uint64_t last_shallow_scrub_timestamp;

Conversation

JacksonYao287 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

xiaoxichen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JacksonYao287 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaoxichen Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

JacksonYao287 commented Mar 8, 2026 •

edited

Loading

codecov-commenter commented Mar 8, 2026 •

edited

Loading

JacksonYao287 Apr 7, 2026 •

edited

Loading

xiaoxichen Apr 7, 2026 •

edited

Loading

JacksonYao287 Apr 8, 2026 •

edited

Loading

xiaoxichen Apr 8, 2026 •

edited

Loading

xiaoxichen Apr 7, 2026 •

edited

Loading