Parallelize backup verification #13292

mszeszko-meta · 2025-01-11T00:30:50Z

Summary

Today, backup verification is serial, which could pose a challenge in rare, high urgency recovery scenarios where we want to timely assess whether candidate backup is not corrupted and eligible for the restore. The timely part will become increasingly more important in case of disaggregated storage.

Semantics

Given the very simple thread pool implementation in backup_engine today, we do not really have a control over initialized threads and consequently do not have an option to unschedule / cancel in-progress tasks. As a result, VerifyBackup won't bail out on a very first mismatch (as it was the case for serial implementation) and instead will iterate over all the files logging success / degree_of_failure for each. We could, in theory, not .wait() on remaining std::future<WorkItem>s (upon previously detected failure) and therefore decrease the observed API latency, but that could cause more confusion down the road as verification threads would still be occupied with inflight/scheduled work and would not be reclaimed by the pool for a while. It's a tradeoff where we choose a solution with clear and intuitive semantics.

Test plan

Kudos to @pdillinger who pointed out that we should already have appropriate fuzzing for max_background_operations and verify_checksum=true parameters in scope of ::VerifyBackup calls in existing backup restore stress test collateral.

[1]

rocksdb/db_stress_tool/db_stress_test_base.cc

Line 2219 in 15a5532

if (s.ok() && thread->rand.OneIn(2)) {

.

facebook-github-bot · 2025-01-11T00:33:34Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger

LGTM, thanks! Probably worth a "behavior change" release note (unreleased_history/add.sh)

facebook-github-bot · 2025-01-21T18:24:09Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-01-21T18:29:50Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-01-21T19:09:23Z

@mszeszko-meta has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-01-21T19:13:29Z

@mszeszko-meta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-01-21T21:43:14Z

@mszeszko-meta merged this pull request in 0e469c7.

facebook-github-bot added the CLA Signed label Jan 11, 2025

mszeszko-meta requested a review from pdillinger January 11, 2025 02:40

pdillinger approved these changes Jan 21, 2025

View reviewed changes

Parallelize backup verification

26412ee

mszeszko-meta force-pushed the parallelize_backup_verification branch from 3e226f5 to 26412ee Compare January 21, 2025 18:24

Add a note about behavior change

5c3f07c

facebook-github-bot closed this in 0e469c7 Jan 21, 2025

facebook-github-bot added the Merged label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize backup verification #13292

Parallelize backup verification #13292

mszeszko-meta commented Jan 11, 2025 •

edited

Loading

facebook-github-bot commented Jan 11, 2025

pdillinger left a comment

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

Parallelize backup verification #13292

Parallelize backup verification #13292

Conversation

mszeszko-meta commented Jan 11, 2025 • edited Loading

Summary

Semantics

Test plan

facebook-github-bot commented Jan 11, 2025

pdillinger left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

facebook-github-bot commented Jan 21, 2025

mszeszko-meta commented Jan 11, 2025 •

edited

Loading