Experiment: verify secondaries in crash test by recreating expected state as of sequence number #13266

archang19 · 2025-01-03T21:24:35Z

Here I experiment with one approach to verify secondaries in the crash tests. I take the most recent state file, replay the trace file up to a certain sequence number, and create an ExpectedState representing the target state as of a particular point in time. This ExpectedState gets used to run comparisons against the secondary.

…ething

Summary: TLDR: This PR enables secondary DB verification inside the "simple" crash tests (`NonBatchedOpsStressTest`). Essentially, we want to be able to verify that the secondary is a valid "prefix" of the primary. This PR allows us to do this by piggybacking on the existing verification of the primary through `Get()` requests. I originally proposed replaying the trace file to recreate the `ExpectedState` as of a specific sequence number. This could be used to run verifications against the secondary database. I did some experimenting in #13266 and got a "mostly working" implementation of this approach. I could sometimes get through entire key space verifications but eventually one of the keys would fail verification. I have not figured out the root cause yet, but I assume that something caused the sequence number to trace record alignment to break. The approach in this PR is considerably simpler. We can just check that the secondary database's value is in the correct "range," which we already have functionality for checking that. Compared to the approach in #13266, this approach is _much, much simpler_ since we do not have to go through the whole headache of replaying the trace and creating an entire new `ExpectedState`. (Look at #13266 to see how much of a mess that creates.) I think this approach is better than my original approach in almost most aspects: it's faster, uses less space, and has less room for implementation errors. Other nice aspects of this approach: 1. We don't need to block the primary. (Another approach you could imagine would be to block writes to the primary, have the secondary catch up, do the whole verification, and then re-enable writes to the primary.) 2. We don't need to block the secondary or do any special coordination (locks, sync points, etc). (If we insist on one "golden" expected value to be read from the secondary, then we need to make sure that another thread does not call `TryCatchUpWithPrimary` while we are trying to perform a `Get()`) 3. More "realistic" usage of the secondary. For instance, writes to the primary and secondary would continue on in production while we try to read from the secondary. The main drawback of course is that we verify against a range of expected values, rather than one particular expected value. However, I think this is acceptable and "good enough" especially with all of other the aforementioned benefits. Historical context: There is some very old code that attempted to verify secondaries, but is not enabled. This code has not been touched or executed in an extremely long time, and the crash tests started failing when I tried enabling it, most likely because the code is not compatible with certain other crash test options. This code is for the "continuous verification" and involves long iterator scans over the secondary database. Some of the code involved the cross CF consistency test type. I don't think the old checks are what we really want for our purposes of verifying the secondary functionality. Since I don't think we will get much value out of this old "continuous verification" code, I integrated my secondary verification with the "regular" database verification. This also makes the rollout simpler on my end, since I can control whether my secondary verifications are enabled through one `test_secondary` configuration. To make sure the old code does not execute for our recurring crash test runs, I had to enforce that `continuous_verification_interval` is 0 whenever `test_secondary` is set. Monitoring: I will want to monitor the Sandcastle "simple" runs for failures where `test_secondary` is set. All of my error messages are prefixed with "Secondary" so it should be easy to tell if this PR causes any crash test issues. Future work: 1. Extend this to followers. I think the same verification method should work, so most of the code from this PR should be reusable 2. Add additional checks to make sure the sequence number of the follower/secondary is actually increasing. For instance, if the primary's sequence number has advanced, and in that period the secondary has not (even after calling `TryCatchUpWithPrimary`), then we know there is a problem 3. Potentially checking things other than `Get()` for the secondary (i.e. iterators). I think the focus here should be testing replication-specific logic, and since we will already have separate unit tests, we do not need to repeat all of tests against both the primary and the secondary. Pull Request resolved: #13281 Test Plan: The primary crash test commands I ran were: ``` python3 tools/db_crashtest.py --simple blackbox --test_secondary=1 python3 tools/db_crashtest.py --simple whitebox --test_secondary=1 ``` As a sanity check, I added an `assert(false)` right after my secondary verification code to make sure that my code was actually being run. Reviewed By: anand1976 Differential Revision: D67953821 Pulled By: archang19 fbshipit-source-id: 0bd853580ea53566be41639f5499eb9b5e0e9376

…13281) Summary: TLDR: This PR enables secondary DB verification inside the "simple" crash tests (`NonBatchedOpsStressTest`). Essentially, we want to be able to verify that the secondary is a valid "prefix" of the primary. This PR allows us to do this by piggybacking on the existing verification of the primary through `Get()` requests. I originally proposed replaying the trace file to recreate the `ExpectedState` as of a specific sequence number. This could be used to run verifications against the secondary database. I did some experimenting in facebook#13266 and got a "mostly working" implementation of this approach. I could sometimes get through entire key space verifications but eventually one of the keys would fail verification. I have not figured out the root cause yet, but I assume that something caused the sequence number to trace record alignment to break. The approach in this PR is considerably simpler. We can just check that the secondary database's value is in the correct "range," which we already have functionality for checking that. Compared to the approach in facebook#13266, this approach is _much, much simpler_ since we do not have to go through the whole headache of replaying the trace and creating an entire new `ExpectedState`. (Look at facebook#13266 to see how much of a mess that creates.) I think this approach is better than my original approach in almost most aspects: it's faster, uses less space, and has less room for implementation errors. Other nice aspects of this approach: 1. We don't need to block the primary. (Another approach you could imagine would be to block writes to the primary, have the secondary catch up, do the whole verification, and then re-enable writes to the primary.) 2. We don't need to block the secondary or do any special coordination (locks, sync points, etc). (If we insist on one "golden" expected value to be read from the secondary, then we need to make sure that another thread does not call `TryCatchUpWithPrimary` while we are trying to perform a `Get()`) 3. More "realistic" usage of the secondary. For instance, writes to the primary and secondary would continue on in production while we try to read from the secondary. The main drawback of course is that we verify against a range of expected values, rather than one particular expected value. However, I think this is acceptable and "good enough" especially with all of other the aforementioned benefits. Historical context: There is some very old code that attempted to verify secondaries, but is not enabled. This code has not been touched or executed in an extremely long time, and the crash tests started failing when I tried enabling it, most likely because the code is not compatible with certain other crash test options. This code is for the "continuous verification" and involves long iterator scans over the secondary database. Some of the code involved the cross CF consistency test type. I don't think the old checks are what we really want for our purposes of verifying the secondary functionality. Since I don't think we will get much value out of this old "continuous verification" code, I integrated my secondary verification with the "regular" database verification. This also makes the rollout simpler on my end, since I can control whether my secondary verifications are enabled through one `test_secondary` configuration. To make sure the old code does not execute for our recurring crash test runs, I had to enforce that `continuous_verification_interval` is 0 whenever `test_secondary` is set. Monitoring: I will want to monitor the Sandcastle "simple" runs for failures where `test_secondary` is set. All of my error messages are prefixed with "Secondary" so it should be easy to tell if this PR causes any crash test issues. Future work: 1. Extend this to followers. I think the same verification method should work, so most of the code from this PR should be reusable 2. Add additional checks to make sure the sequence number of the follower/secondary is actually increasing. For instance, if the primary's sequence number has advanced, and in that period the secondary has not (even after calling `TryCatchUpWithPrimary`), then we know there is a problem 3. Potentially checking things other than `Get()` for the secondary (i.e. iterators). I think the focus here should be testing replication-specific logic, and since we will already have separate unit tests, we do not need to repeat all of tests against both the primary and the secondary. Pull Request resolved: facebook#13281 Test Plan: The primary crash test commands I ran were: ``` python3 tools/db_crashtest.py --simple blackbox --test_secondary=1 python3 tools/db_crashtest.py --simple whitebox --test_secondary=1 ``` As a sanity check, I added an `assert(false)` right after my secondary verification code to make sure that my code was actually being run. Reviewed By: anand1976 Differential Revision: D67953821 Pulled By: archang19 fbshipit-source-id: 0bd853580ea53566be41639f5499eb9b5e0e9376

facebook-github-bot added the CLA Signed label Jan 3, 2025

archang19 force-pushed the stress-test-follower branch 3 times, most recently from e9fdd31 to bd06353 Compare January 6, 2025 23:52

archang19 added 8 commits January 7, 2025 09:34

Experimenting with existing db secondary tests

ee658fe

Update Restore to use DBType

f263587

Refactor out the replay code

1cd5b43

Add a GetExpectedState method

95bdda9

Try adding expected state replay

61dc65b

Add calls to GetExpectedState

373448e

logging updates

6295d88

Create new FileSnapshotExpectedState to use instead of AnonExpectedState

d84b10b

archang19 force-pushed the stress-test-follower branch from bd06353 to 3316533 Compare January 7, 2025 17:34

Verification for secondary test fails but at least it is checking som…

de7f11c

…ething

archang19 force-pushed the stress-test-follower branch from 3316533 to de7f11c Compare January 7, 2025 18:16

archang19 added 6 commits January 7, 2025 11:28

With iterator instead of a bunch of gets

71ddd60

Different file name

26b99d5

With mutex

9b300c2

clean up old verification files

e9ff3cf

Test with copying trace file. Still hit failures

08f62cd

Record last timestamp found in trace

0e25c6b

archang19 mentioned this pull request Jan 8, 2025

Verify values in secondary database against expected state #13281

Closed

archang19 changed the title ~~[Experimental] See how we can test followers and secondaries~~ Experiment: verify secondaries in crash test by recreating expected state as of sequence number Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experiment: verify secondaries in crash test by recreating expected state as of sequence number #13266

Experiment: verify secondaries in crash test by recreating expected state as of sequence number #13266

archang19 commented Jan 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Experiment: verify secondaries in crash test by recreating expected state as of sequence number #13266

Are you sure you want to change the base?

Experiment: verify secondaries in crash test by recreating expected state as of sequence number #13266

Conversation

archang19 commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

archang19 commented Jan 3, 2025 •

edited

Loading