feature(nemesis): Move code responsibility to nemesis classes #10935

pehala · 2025-05-25T08:44:59Z

Why?

Currently, the disrupt methods are carrier of the code, they need to be all in the same class (Nemesis) and the only reason that Monkey classes exist is to link method with flags to be able to do filtering. Code in the disrupt() methods of the Monkey classes is never actually executed it is just scanned to do the linking.

This is extremely confusing and not obvious when you look at it and also removes any possibility of code organization as all code needs to live in one huge class. This PR aims to simplify how current nemesis code is executed and allow moving the classes/code to submodules to improve code maintainability and readability.

How?

Make all NemesisMonkey inherit from a new class NemesisBaseClass, main purpose of the classes is to carry metadata to execute nemesis (flags, exclusive, etc). The actual code that is executed live in the __init__ so they are stateless and are instantiated on every nemesis invocation. __init__ also takes additional parameter runner, which is the original Nemesis class which still contains all the methods and all of the argus/es/log configuration, but is reused.

The original runners like SisyphusMonkey or CategoricalMonkey still derive from Nemesis class as those affect how the nemesis are executed and what is their order, but they do not care about the code. From configuration standpoint there is only one big thing, you cannot specify Nemesis class in nemesis-class-name parameter, but you can filter by it in the selector. This change also means we can make SisyphusMonkey default class and stop using that parameter for majority of the tests.

This makes is so that the change is minimal and for now the all nemesis code is unchanged, only the Nemesis classes (and target_node marks) are slightly altered changed, but allows for migrating functional code out of the main class to individual Monkeys and to create logical hierarchies (e.g. TopologicalNemesis) where shared code for bunch of Nemesis could live, without polluting the main class.

It builds on top of #10912 for a better battery of tests.

Current state of the PR

PR is functional and passed basic single nemesis testing, but is far from finished. Mostly serves as a proposal for I am looking for opinions on general idea and concepts

Fix CategoricalMonkey and other "runners"
Finalize naming of the classes
Do more testing
Polish code

Future

For follow-ups, I can see slow migration from the nemesis class to a module, nemesis by nemesis

Testing

Unit testing
Individual Nemesis testing:
- https://argus.scylladb.com/tests/scylla-cluster-tests/77a1127a-1068-431a-8992-2c65947dc104
Longevity testing

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

pehala · 2025-05-25T08:48:28Z

Important change here is also the fact, the names of the nemesis changed to the name of the class instead of the method, which might be a breaking change

soyacz

IMHO, looks like right direction.

One more thing to consider when thinking about moving logic under specific nemeses: some of them use logic/methods from another one. Need to think how to approach this - or allowing creating Nemesis instances in other nemesis, or extract disruption functions to standalone functions.

soyacz · 2025-05-27T12:48:15Z

sdcm/nemesis.py

+    exclusive: bool = False  # True, if the nemesis is exclusive, i.e. it should not run in parallel with other nemeses
+
+
+class Nemesis:


Maybe calling it NemesisRunner or NemesisThread?

I like both of these, with NemesisThread being a little bit more preferable. I did not rename it here to avoid adding more code changes, but I can if you feel like it would be better

soyacz · 2025-05-27T12:52:47Z

sdcm/nemesis.py

    disruptive = False
    config_changes = True
    supports_high_disk_utilization = True

-    def disrupt(self):
-        self.disrupt_hot_reloading_internode_certificate()
+    def __init__(self, runner):


it doesn't feel natural to run disruption in __init__.
I'd propose doing it in __call__ or having separate method for that.
in future we could use __init__ to call specific logic that will define if this one ought be skipped (this may vary during the test based on Scylla version, existing table or any other factor)

I thought about it and for now I ended with __init__ for two reasons:
a) Stateless - I feel like we want nemesis to be stateless for simplicity and to make sure multiple runs of single nemesis cannot affect each other
b) As close to drop in replacement compared to the current state

If we used a different method, we would have to instantiate first (when? every time we want to call it? only once?) and only then call __call__ or other method.

For skipping and similar logic I would lean towards marks or a class method which would be independent on the actual nemesis code

pehala · 2025-05-27T13:23:36Z

One more thing to consider when thinking about moving logic under specific nemeses: some of them use logic/methods from another one. Need to think how to approach this - or allowing creating Nemesis instances in other nemesis, or extract disruption functions to standalone functions.

You can use hierarchy to achieve that or use Mixins for doing that. Multi level hierarchies will not work with this PR for now, as even intermediate nemesis would be selected. I have plan for not selecting abstract classes which would allow this use case.

Important thing here is that this design does not force anything specific, so we can decide on how to tackle organization and inheritance later

Currently, the main carrier of the code are the disrupt methods, with nemesis classes only being scanner by regex and are used to match disrupt methods with flags. This commit moves all the execution part to the classes themselves.

Updates to use the new method of selecting specific Nemesis

* Change it to be compatible with nemesis rework * Improve it to use nemesis filtering and seed.

pehala · 2025-05-27T14:56:38Z

v2:

Rebased
Fix CategoricalMonkey
- Also updated it with usage of nemesis_seed and selectors
Fix rest of the tests

pehala requested review from soyacz and timtimb0t May 25, 2025 08:44

pehala added the backport/none Backport is not required label May 25, 2025

github-actions bot assigned pehala May 25, 2025

pehala force-pushed the nemesis_rework_v2 branch 4 times, most recently from 0a0279c to 646d270 Compare May 26, 2025 09:48

soyacz reviewed May 27, 2025

View reviewed changes

pehala force-pushed the nemesis_rework_v2 branch from 646d270 to d60fedf Compare May 27, 2025 14:15

pehala added 5 commits May 27, 2025 16:27

chore(nemesis): Update tests to reflect class change

ea53d28

chore(nemesis): Update nemesis config template

fa698c2

Updates to use the new method of selecting specific Nemesis

chore(nemesis): Regenerate all pipelines and nemesis.yml

734dfbe

feature(nemesis): Update CategoricalMonkey for new format

8fd2f96

* Change it to be compatible with nemesis rework * Improve it to use nemesis filtering and seed.

pehala force-pushed the nemesis_rework_v2 branch from d60fedf to 8fd2f96 Compare May 27, 2025 14:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature(nemesis): Move code responsibility to nemesis classes #10935

feature(nemesis): Move code responsibility to nemesis classes #10935

Uh oh!

pehala commented May 25, 2025 •

edited

Loading

Uh oh!

pehala commented May 25, 2025

Uh oh!

soyacz left a comment

Uh oh!

soyacz May 27, 2025

Uh oh!

pehala May 27, 2025 •

edited

Loading

Uh oh!

soyacz May 27, 2025

Uh oh!

pehala May 27, 2025

Uh oh!

pehala commented May 27, 2025

Uh oh!

pehala commented May 27, 2025

Uh oh!

Uh oh!

		exclusive: bool = False # True, if the nemesis is exclusive, i.e. it should not run in parallel with other nemeses


		class Nemesis:

feature(nemesis): Move code responsibility to nemesis classes #10935

Are you sure you want to change the base?

feature(nemesis): Move code responsibility to nemesis classes #10935

Uh oh!

Conversation

pehala commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why?

How?

Current state of the PR

Future

Testing

PR pre-checks (self review)

Uh oh!

pehala commented May 25, 2025

Uh oh!

soyacz left a comment

Choose a reason for hiding this comment

Uh oh!

soyacz May 27, 2025

Choose a reason for hiding this comment

Uh oh!

pehala May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soyacz May 27, 2025

Choose a reason for hiding this comment

Uh oh!

pehala May 27, 2025

Choose a reason for hiding this comment

Uh oh!

pehala commented May 27, 2025

Uh oh!

pehala commented May 27, 2025

Uh oh!

Uh oh!

pehala commented May 25, 2025 •

edited

Loading

pehala May 27, 2025 •

edited

Loading