-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improvement(nemesis): Rework nemesis discovery #10502
base: master
Are you sure you want to change the base?
Conversation
6ba1735
to
2251dae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the direction LGTM
I think we can throw away a few more thing while we are at it. (with too much work)
2251dae
to
df1154c
Compare
v2 - Drop the backwards compatibility of this patch and simplified the logic:
|
The The same about the After it, first and 4th commits could be combined into one merging their nice descriptions. It will simplify reviewing changes of this PR. Then,
Can we consider this one (#10502) as a first in the chain and get it merged sooner than later? |
I will extract those
We can make this one first, I can edit the rest to rely on this one |
v3:
next version should be the final after the extracted PR are incorporated and final comments are addressed |
cdbab07
to
3eb6b15
Compare
v4:
I also updated cover letter and started a test run to verify correctness, as this patch now affects actual testcases. I will take this out of draft, once the run is finished and passing |
sdcm/nemesis.py
Outdated
AllMonkey, MdcChaosMonkey, | ||
DisruptiveMonkey, NonDisruptiveMonkey, GeminiNonDisruptiveChaosMonkey, | ||
GeminiChaosMonkey, NetworkMonkey, SisyphusMonkey, | ||
COMPLEX_NEMESIS = [NoOpMonkey, ChaosMonkey, ScyllaCloudLimitedChaosMonkey, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how it worked with NemesisSequence
- this one should also be excluded. How code knows that we should not include disrupt_run_unique_sequence
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not, it seems like disrupt_run_unique_sequence
was always collected, see
scylla-cluster-tests/data_dir/nemesis.yml
Line 309 in 93e7488
- disrupt_run_unique_sequence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ineed, I see it was run 41 times in 2025.1 testing. But I'm not sure if we should - this is mainly used in perf tests.
3eb6b15
to
b82d4d3
Compare
b82d4d3
to
2b8b3e0
Compare
2b8b3e0
to
cd2b90d
Compare
v5 - Unify disrupt method execution mechanisms:
This change could be theoretically extracted out of this PR, but it would mean only one of the execution patch would call the new code. I consider ComplexNemesis as mostly outdated so this should not be a big issue |
cd2b90d
to
4d81b4a
Compare
7aa77d8
to
0e034cf
Compare
…lass * Add NemesisRegistry class * It is Responsible for discovering and filtering NemesisClasses and disrupt methods * Doesnt need to instance Nemesis class * Reduced number of method comapred to previous: * get_disrupt_methods for filtering, takes in logical_phrase * gather_properties for exporting * Change nemesis.yml structure * Now it is a pure dict, instead of list of strings * test_nemesis_sisyphus.py no longer needs Fake classes to generate the .yml files * Speed up nemesis discovery by checking source code only for the disrupt method * Change nemesis binding to be based on Class instead of an Instance
LimitedChaosMonkey, GeminiNonDisruptiveChaosMonkey, GeminiChaosMonkey, NetworkMonkey, NonDisruptiveMonkey, DisruptiveMonkey, FreeTierSetMonkey, SlaNemesis
…_methods Previously it worked because we already had a disrupt_add_remove_dc disrupt method. Change it so it actually tests @Nemesis.add_disrupt_method
Currently, two separate mechanisms exist to call disrupt methods: build_list_of_disruptions used by SisyphusMonkey call_random_disrupt_method used by all complex monkeys These two mechanism differ in yaml properties it supports, in usage of random_seed and how they discover disrupt methods. Former using NemesisRegistry, later filtering it directly. This commit unities both of the usecases under one code Remove AllMonkey and ChaosMonkey as they can be replaced by SisiphusMonkey
3f3fa05
to
dfeca9c
Compare
Problem statement
Currently, the Nemesis class has too many responsibilities, which causes problems, one of the problems is that it requires parameters to initialize and does a lot of logic in the
init
function and you need to provide those even if you need only part of the class. Problem manifests can be seen intest_nemesis_sisyphus
which needs to mock Tester, Cluster and Node to filter nemesis, but none of methods used required any of this, we need to provide it because of the aforementioned problem.Solution
Extract Nemesis discovery (i.e. Gathering all disrupt method and matching it with subclasses) to a separate class. While doing so, also reduce the nemesis discovery methods needed. Only one method for filtering is now present (
NemesisRegistry.get_disrupt_methods
) and input is a logical phrase. To allow also extracting properties tonemesis.yaml
/nemesis_classes.yaml
addgather_properties
.nemesis.yaml
was also changed from list containing strings, to a full on dict. Dict is sorted so the desired output is essentially the same, but it requires less processing to write/read.All Monkey which only filtered by flags (
LimitedChaosMonkey
,GeminiNonDisruptiveChaosMonkey
,GeminiChaosMonkey
,NetworkMonkey
,NonDisruptiveMonkey
,DisruptiveMonkey
,FreeTierSetMonkey
)were removed and replaced by
nemesis_selector
usageChangelog
NemesisRegistry
classdisrupt
methodtest_nemesis_sisyphus
from 13 sec to 3 sec, including python env initializationtest_nemesis_sisyphus
to demonstrate improvementsnemesis.yaml
structure into a dictTesting
test-cases/longevity/longevity-sla-100gb-4h
test, as that is one of the test affected by the changesPR pre-checks (self review)
backport
labels