-
Notifications
You must be signed in to change notification settings - Fork 100
fix(TTL expired tombstones): Test TTL-expired tombstones according to new sstable dump format #10180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(TTL expired tombstones): Test TTL-expired tombstones according to new sstable dump format #10180
Conversation
ffd19c7
to
a684d95
Compare
A failure example can now look like:
|
The test found an un-expected TTLed tombstone and failed. TTL-expired tombstones remained in SSTables beyond the expected GC time. Given that the garbage collection window should be max(gc_grace_seconds, propagation_delay_in_seconds) = 4 minutes, tombstones should have been removed by 09:18:32, but were still present when the test checked recently(5 minutes)-created-sstables at 09:23:50. propagation delay = gc-grace-seconds = 240 (4 minutes):
TTL expired tombstone deletion time is: '2025-02-24 09:11:23z' repair time for node-1 is
Looked for sstables of the last 5 minutes (09:18:51 and on):
Failure event - a recently created sstable contains an "old" (pre-repair) tombstone ('2025-02-24 09:11:23z') :
@tchaikov , @asias , can you please advise? could it be a test issue incorrectly expecting sstable to clear tombstones? or a scylla issue? |
a684d95
to
167db25
Compare
A test also failed with other gc-mode like "immediate", so something with ttl-expired-tombstones is not clear here:
|
It is unclear to me what this test expects. Note that regardless of the mode, tombstones don't just disappear after they become purgeable, they need a compaction to run to purge them. Even then, compaction can fail to purge tombstones because of overlap checks (be that false-positive or true-positive). Overlap checks look at memtables, commitlog and other sstables. So for best results: use |
@denesb , AFAIK, this test is based on tombstones-gc functionality, as also tested in Dtest like in https://github.com/scylladb/scylla-dtest/blame/4fdd2393c3f271d5716906f74b577686a73bef2e/compaction_test.py#L283. |
(col_data["deletion_time"] | ||
for row in partition.get("clustering_elements", []) | ||
for col_data in row.get("columns", {}).values() | ||
if "deletion_time" in col_data), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will match both live cells which have TTL as well as expired ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@denesb , following your and Kefu's suggestions i concluded "deletion_time" only present for ttl-expired tombstones. Indeed, when tested, it looks like non-expired cells has no "deletion_time", rather it has "expiry" instead. example for live cell:
{"sstables":{"/var/lib/scylla/data/scylla_bench/test-883a9260eea111efabef9647978d48d9/me-3gnx_0pqw_2u8k02uhw2g687mhsg-big-Data.db":[{"key":{"token":"16290705325071","raw":"00080000000b000008f9","value":"47244642553"},"clustering_elements":[{"type"
:"clustering-row","key":{"raw":"0008e7da6eeebcaeb407","value":"-1739956334378437625"},"marker":{"timestamp":1739956351291315,"ttl":"240s","expiry":"2025-02-19 09:16:31z"},"columns":{"v":{"is_live":true,"type":"regular","timestamp":1739956351291315
,"ttl":"240s","expiry":"2025-02-19 09:16:31z","value":"00000000000000000000"}}},
Example for a tombstone non-live (ttl-expired) cell:
{"sstables":{"/var/lib/scylla/data/scylla_bench/test-883a9260eea111efabef9647978d48d9/me-3gnx_0sy7_1xnps214se2r0gj72o-big-Data.db":[{"key":{"token":"-9223362767843183216","raw":"0008000000180005c8c7","value":"103079594183"},"clustering_elements":[
{"type":"clustering-row","key":{"raw":"0008e7da6dca50dad56f","value":"-1739957590317935249"},"marker":{"timestamp":1739960371561769},"columns":{"v":{"is_live":false,"type":"regular","timestamp":1739960371561769,"deletion_time":"2025-02-19 10:19:31
z"}}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned already, it is possible that TTL'ed cell is expired already but in the sstable, it still appears as a live cell with TTL (already expired TTL). When the sstable is rewritten, this is updated, but there isn't a deadline for this, ScyllaDB can do this so late that the TTL'd cells are dropped immediately as they are already garbage collectible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of you response below, I think this is not the reason the test is failing. If anything, this mistake just make the test pass because it won't notice expired cells, still left in sstables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @denesb . So if i understand correctly, even though the test waits propagation delay and TTL expiration, it is still no certain that the major compaction will generate all new sstables without the tombstones.
@pehala , i think if this is the case, then this test is less relevant or quite useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wait max(propagation-delay, TTL), then do flush+major, you should see no tombstones at all, nor any cells which look live but are expired actually (TTL is up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test passed ok after reordering some test steps.
@@ -130,5 +129,5 @@ def test_switch_tombstone_gc_modes(self): | |||
self.log.debug('Starting sstabledump to verify correctness of tombstones for %s sstables', | |||
len(sstables)) | |||
for sstable in sstables: | |||
tombstone_deletion_info = sstable_utils.get_compacted_tombstone_deletion_info(sstable=sstable) | |||
tombstone_deletion_info = sstable_utils.get_ttl_expired_tombstone_deletion_info(sstable=sstable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this method intended to return? Currently it returns any partition which has at least one cell with either a tombstone or a TTL'd cell (live or expired). Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@denesb ,No, it is expected to return only a partition that has a tombstone (in one of it's rows).
As described in above comment of #10180 (comment).
So the test expect no tombstones exist at all at it's end, but a partition with a tombstone row cell is found, failing this validation.
The test scenario is:
running s-b load like:
scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=10 -rows-per-request=1 -start-timestamp=1740475139106848950 -connection-count 10 -max-rate 30000 -timeout=120s -retry-number=30 -retry-interval=80ms,1s -duration=80m
Altering the s-b table during test steps like:
< t:2025-02-25 09:18:38,953 f:common.py l:1318 c:utils p:DEBUG > Executing CQL 'ALTER TABLE scylla_bench.test with gc_grace_seconds = 240 and default_time_to_live = 240 and compaction = {'class':'TimeWindowCompactionStrategy', 'compaction_window_size': 5, 'compaction_window_unit': 'MINUTES'} and tombstone_gc = {'mode': 'disabled', 'propagation_delay_in_seconds':'240'};' ...
< t:2025-02-25 09:18:38,954 f:common.py l:1326 c:utils p:DEBUG > Executing CQL 'ALTER TABLE scylla_bench.test with gc_grace_seconds = 240 and default_time_to_live = 240 and compaction = {'class':'TimeWindowCompactionStrategy', 'compaction_window_size': 5, 'compaction_window_unit': 'MINUTES'} and tombstone_gc = {'mode': 'disabled', 'propagation_delay_in_seconds':'240'};' ...
< t:2025-02-25 09:44:54,739 f:common.py l:1318 c:utils p:DEBUG > Executing CQL 'ALTER TABLE scylla_bench.test with gc_grace_seconds = 864000 and tombstone_gc = {'mode': 'repair', 'propagation_delay_in_seconds':'240'};' ...
< t:2025-02-25 09:44:54,740 f:common.py l:1326 c:utils p:DEBUG > Executing CQL 'ALTER TABLE scylla_bench.test with gc_grace_seconds = 864000 and tombstone_gc = {'mode': 'repair', 'propagation_delay_in_seconds':'240'};' ...
< t:2025-02-25 09:52:00,653 f:common.py l:1318 c:utils p:DEBUG > Executing CQL 'ALTER TABLE scylla_bench.test with tombstone_gc = {'mode': 'immediate', 'propagation_delay_in_seconds':'300'};' ...
< t:2025-02-25 09:52:00,653 f:common.py l:1326 c:utils p:DEBUG > Executing CQL 'ALTER TABLE scylla_bench.test with tombstone_gc = {'mode': 'immediate', 'propagation_delay_in_seconds':'300'};' ...
So, after modifying to 'mode': 'immediate'
, the test waits for load to finish, runs a flush and major compaction and expects no tombstones at all. But it did find some tombstones of ttl-expired partition row cell. (s-b writes a simple key value column pattern).
As for s-b write pattern, i guess it doesn't update a partial partition or row, only full partition content. Is there a chance that an expired cell doesn't become a tombstone because of other non-expired cells in the same partition?
i think, according to s-b pattern, all partition rows are expected to expire at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, after modifying to
'mode': 'immediate'
, the test waits for load to finish, runs a flush and major compaction and expects no tombstones at all. But it did find some tombstones of ttl-expired partition row cell. (s-b writes a simple key value column pattern).
I see, I think I finally understand what the test is doing.
As for s-b write pattern, i guess it doesn't update a partial partition or row, only full partition content. Is there a chance that an expired cell doesn't become a tombstone because of other non-expired cells in the same partition?
No, converting TTL'd cells to tombstone is independent of other content of the partition, with the exception of higher order tombstones (with later timestamp). When those are present, ScyllaDB will drop the cell (live or not) as it is shadowed by the higher order tombtone.
i think, according to s-b pattern, all partition rows are expected to expire at the same time.
Removing default_time_to_live
only applies to data written after this alter. Previous data will still have the earlier TTL applied to it. Compaction doesn't change this.
Also, even with 'mode': 'immediate'
, we still have propagation_delay
, tombstones will not be eligible for garbage-collection before this expires. So if you want all tombstones to be gone, you need to wait for the load to stop, then wait propagation_delay
seconds and only then do the flush+major. You also need to make sure that default_time_to_live
seconds have passed since the last alter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i'll recheck if there's a testing issue in:
'ALTER TABLE scylla_bench.test with gc_grace_seconds = 864000 and tombstone_gc = {'mode': 'repair', 'propagation_delay_in_seconds':'240'};'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test passed ok after reordering some test steps.
35a969b
to
b763b9f
Compare
The test passed ok after applying some limitations. Some log output looks like:
|
b763b9f
to
7690dd5
Compare
afd2ed4
to
6a5d404
Compare
@pehala , conflicts are resolved now (your approval was dismissed ) |
@scylladb/qa-maintainers ping for review |
@scylladb/qa-maintainers yet another ping for review/merge cc: @pehala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
besides one question, LGTM (but I didn't verify test logic much)
@@ -1,6 +1,6 @@ | |||
test_duration: 200 | |||
|
|||
stress_cmd: ["scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=10 -rows-per-request=1 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 10 -max-rate 30000 -timeout=120s -retry-number=30 -retry-interval=80ms,1s -duration=80m"] | |||
stress_cmd: ["scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=10 -rows-per-request=1 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 10 -max-rate 30000 -timeout=120s -retry-number=30 -retry-interval=80ms,1s -duration=4m"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 4m enough or just leftover from testing?
Btw. shouldn't it just insert all the intended data and end without overwrites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the test is modified due to former stability issues.
please read in title of last commit:
fix(test_switch_tombstone_gc_modes): run stress in cycles for each test step
Run stress per tombstone-gc-mode step of the test.
This way, stress writes are not mixed between different gc-modes and causes confusions.
|
) | ||
|
||
self.log.debug("Found %s tombstones in SSTable %s", num_tombstones, sstable) | ||
f"sudo awk -F 'deletion_time' '{{print NF-1}}' {self.REMOTE_SSTABLEDUMP_PATH}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use awk
(and why with sudo
), when we have a structured output (JSON
) that you can parse with jq
.
Also, this indiscriminate matching of deletion_time
might include things which have no TTL, like regular tombstones.
To match only TTL expired tombstones, you need to match only the following entities:
- row marker
- cells -- both regular and collection cells
And nothing else. Higher-order tombstones also have deletion_time
but they are not subject to TTL and thus would constitute false-positive matches here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so for avoiding confusions and false-positive, the method is renamed "count_tombstones". So currently no differentiation of tombstone type, which is sufficient for this test.
We could consider using jq for the next steps, but that would also introduce a stronger dependency on implementation details.
In any case we better use it first by adding ttl tests to dtest.
sstable_data = dump_data['sstables'].get(sstable, []) | ||
|
||
# Extract entries that contain a tombstone | ||
tombstones_deletion_info = [entry for entry in sstable_data if 'tombstone' in entry] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will match only the partition tombstone.
Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, with this regard, SCT, so far, only handles partition tombstone.
Yet it is a good point and meant to be addressed in tombstone test plan and in tests.
It may first be covered in either Dtest or test.py.
I opened an SCT issue for that - #10595.
Retested and passed ok in: https://argus.scylladb.com/tests/scylla-cluster-tests/3fc207e3-5343-4eea-a438-673cfe439cd5 |
51ab889
to
7138f9e
Compare
@yarongilor please rebase |
…o new sstable dump format The new format of scylla "sstable dump-data" command dumps the data as is, unlike in the past versions. The test of longevity_tombstone_gc_test.py should count and parse the ttl-expired tombstones according to this format. Refs: scylladb#9980
…st step Run stress per tombstone-gc-mode step of the test. This way, stress writes are not mixed between different gc-modes and causes confusions.
7138f9e
to
f71e728
Compare
@soyacz , rebased. |
The new format of scylla "sstable dump-data" command dumps the data as is, unlike in the past versions. The test of longevity_tombstone_gc_test.py should count and parse the ttl-expired tombstones according to this format.
Refs: #9980
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)