-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize backup verification #13292
Parallelize backup verification #13292
Conversation
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks! Probably worth a "behavior change" release note (unreleased_history/add.sh)
3e226f5
to
26412ee
Compare
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta has updated the pull request. You must reimport the pull request before landing. |
@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mszeszko-meta merged this pull request in 0e469c7. |
Summary
Today, backup verification is serial, which could pose a challenge in rare, high urgency recovery scenarios where we want to timely assess whether candidate backup is not corrupted and eligible for the restore. The timely part will become increasingly more important in case of disaggregated storage.
Semantics
Given the very simple thread pool implementation in
backup_engine
today, we do not really have a control over initialized threads and consequently do not have an option to unschedule / cancel in-progress tasks. As a result,VerifyBackup
won't bail out on a very first mismatch (as it was the case for serial implementation) and instead will iterate over all the files logging success / degree_of_failure for each. We could, in theory, not.wait()
on remainingstd::future<WorkItem>
s (upon previously detected failure) and therefore decrease the observed API latency, but that could cause more confusion down the road as verification threads would still be occupied with inflight/scheduled work and would not be reclaimed by the pool for a while. It's a tradeoff where we choose a solution with clear and intuitive semantics.Test plan
Kudos to @pdillinger who pointed out that we should already have appropriate fuzzing for
max_background_operations
andverify_checksum
=true
parameters in scope of::VerifyBackup
calls in existing backup restore stress test collateral.[1]
rocksdb/db_stress_tool/db_stress_test_base.cc
Line 2219 in 15a5532