Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize backup verification #13292

Conversation

mszeszko-meta
Copy link
Contributor

@mszeszko-meta mszeszko-meta commented Jan 11, 2025

Summary

Today, backup verification is serial, which could pose a challenge in rare, high urgency recovery scenarios where we want to timely assess whether candidate backup is not corrupted and eligible for the restore. The timely part will become increasingly more important in case of disaggregated storage.

Semantics

Given the very simple thread pool implementation in backup_engine today, we do not really have a control over initialized threads and consequently do not have an option to unschedule / cancel in-progress tasks. As a result, VerifyBackup won't bail out on a very first mismatch (as it was the case for serial implementation) and instead will iterate over all the files logging success / degree_of_failure for each. We could, in theory, not .wait() on remaining std::future<WorkItem>s (upon previously detected failure) and therefore decrease the observed API latency, but that could cause more confusion down the road as verification threads would still be occupied with inflight/scheduled work and would not be reclaimed by the pool for a while. It's a tradeoff where we choose a solution with clear and intuitive semantics.

Test plan

Kudos to @pdillinger who pointed out that we should already have appropriate fuzzing for max_background_operations and verify_checksum=true parameters in scope of ::VerifyBackup calls in existing backup restore stress test collateral.

[1]

if (s.ok() && thread->rand.OneIn(2)) {
.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Probably worth a "behavior change" release note (unreleased_history/add.sh)

@mszeszko-meta mszeszko-meta force-pushed the parallelize_backup_verification branch from 3e226f5 to 26412ee Compare January 21, 2025 18:24
@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mszeszko-meta merged this pull request in 0e469c7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants