Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement(nemesis): Rework nemesis discovery #10502

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pehala
Copy link
Contributor

@pehala pehala commented Mar 25, 2025

Problem statement

Currently, the Nemesis class has too many responsibilities, which causes problems, one of the problems is that it requires parameters to initialize and does a lot of logic in the init function and you need to provide those even if you need only part of the class. Problem manifests can be seen in test_nemesis_sisyphus which needs to mock Tester, Cluster and Node to filter nemesis, but none of methods used required any of this, we need to provide it because of the aforementioned problem.

Solution

Extract Nemesis discovery (i.e. Gathering all disrupt method and matching it with subclasses) to a separate class. While doing so, also reduce the nemesis discovery methods needed. Only one method for filtering is now present (NemesisRegistry.get_disrupt_methods) and input is a logical phrase. To allow also extracting properties to nemesis.yaml/nemesis_classes.yaml add gather_properties.

nemesis.yaml was also changed from list containing strings, to a full on dict. Dict is sorted so the desired output is essentially the same, but it requires less processing to write/read.

All Monkey which only filtered by flags (LimitedChaosMonkey, GeminiNonDisruptiveChaosMonkey, GeminiChaosMonkey, NetworkMonkey, NonDisruptiveMonkey, DisruptiveMonkey, FreeTierSetMonkey)
were removed and replaced by nemesis_selector usage

Changelog

  • Add NemesisRegistry class
    • Responsible for gathering and filtering available disrupt methods
    • Semi-generic, theoretically it could be used on any class that has disrupt method, not only Nemesis
      • Currently only requirement is that is has disrupt method
    • Requires no runtime information to initialize
  • Limit source code searching to disrupt method
    • This improvement changes time of test_nemesis_sisyphus from 13 sec to 3 sec, including python env initialization
  • Rewrite test_nemesis_sisyphus to demonstrate improvements
  • Remove Monkeys that use filtering by flags
  • Change nemesis.yaml structure into a dict

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

@pehala pehala requested a review from fruch March 25, 2025 14:35
@pehala pehala force-pushed the extract_nemesis_discovery branch 2 times, most recently from 6ba1735 to 2251dae Compare March 25, 2025 15:06
@fruch fruch requested a review from roydahan April 1, 2025 08:29
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the direction LGTM

I think we can throw away a few more thing while we are at it. (with too much work)

@fruch fruch requested a review from a team April 1, 2025 08:46
@pehala pehala force-pushed the extract_nemesis_discovery branch from 2251dae to df1154c Compare April 1, 2025 15:11
@pehala
Copy link
Contributor Author

pehala commented Apr 1, 2025

v2 - Drop the backwards compatibility of this patch and simplified the logic:

  • Unify filtering of nemesis
    • Should simplify a lot of unnecessary logic, only filtering method is now get_disrupt_methods(logical_phrase)
  • Remove DEPRECATED_NEMESIS
  • Change structure of nemesis.yml
    • It is now a dict, instead of list of strings. I think this is easier to parse
  • Make ComplexNemesis, which were only filtering based on flag, inherit from SisyphusMonkey (Needs testing)
    • Reduces code and unifies logic with the SisyphusMonkey
  • Remove kubernetes flag from NemesisRegistry, move the logic into build_list_of_disruptions

@vponomaryov
Copy link
Contributor

The improvement(nemesis): Remove deprecated Nemesis commit can and should be moved out to a separate PR and be merged.

The same about the fix(nemesis.yml): Update nemesis.yml and nemesis_classes.yml.

After it, first and 4th commits could be combined into one merging their nice descriptions.

It will simplify reviewing changes of this PR.

Then,
I am already lost in the following nemesis-related PRs:

Can we consider this one (#10502) as a first in the chain and get it merged sooner than later?
Idea is great, let's not waste time...

@pehala
Copy link
Contributor Author

pehala commented Apr 2, 2025

The improvement(nemesis): Remove deprecated Nemesis commit can and should be moved out to a separate PR and be merged.
The same about the fix(nemesis.yml): Update nemesis.yml and nemesis_classes.yml.

I will extract those

The same about the fix(nemesis.yml): Update nemesis.yml and nemesis_classes.yml.
Can we consider this one (#10502) as a first in the chain and get it merged sooner than later?
Idea is great, let's not waste time...

We can make this one first, I can edit the rest to rely on this one

EDIT: Created #10575 and #10574

@pehala
Copy link
Contributor Author

pehala commented Apr 2, 2025

v3:

  • Removed Nemesis, which filter only by flags
    • LimitedChaosMonkey, GeminiNonDisruptiveChaosMonkey, GeminiChaosMonkey, NetworkMonkey,
      NonDisruptiveMonkey, DisruptiveMonkey, FreeTierSetMonkey

next version should be the final after the extracted PR are incorporated and final comments are addressed

@pehala pehala force-pushed the extract_nemesis_discovery branch 3 times, most recently from cdbab07 to 3eb6b15 Compare April 3, 2025 07:23
@pehala
Copy link
Contributor Author

pehala commented Apr 3, 2025

v4:

  • Rebased on top of master and squashed commits
  • Add docstrings

I also updated cover letter and started a test run to verify correctness, as this patch now affects actual testcases. I will take this out of draft, once the run is finished and passing

sdcm/nemesis.py Outdated
AllMonkey, MdcChaosMonkey,
DisruptiveMonkey, NonDisruptiveMonkey, GeminiNonDisruptiveChaosMonkey,
GeminiChaosMonkey, NetworkMonkey, SisyphusMonkey,
COMPLEX_NEMESIS = [NoOpMonkey, ChaosMonkey, ScyllaCloudLimitedChaosMonkey,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how it worked with NemesisSequence - this one should also be excluded. How code knows that we should not include disrupt_run_unique_sequence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not, it seems like disrupt_run_unique_sequence was always collected, see

- disrupt_run_unique_sequence:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ineed, I see it was run 41 times in 2025.1 testing. But I'm not sure if we should - this is mainly used in perf tests.

@pehala pehala force-pushed the extract_nemesis_discovery branch from 2b8b3e0 to cd2b90d Compare April 4, 2025 13:22
@pehala
Copy link
Contributor Author

pehala commented Apr 4, 2025

v5 - Unify disrupt method execution mechanisms:
Currently, two separate mechanisms exist to call disrupt methods:
build_list_of_disruptions used by SisyphusMonkey
call_random_disrupt_method used by all complex monkeys
These two mechanism differ in yaml properties it supports, in usage of random_seed
and how they discover disrupt methods.
Former using NemesisRegistry, later filtering it directly.

  • Change how build_list_of_disruptions works
    • Rename it to build_disruptions_by_selector
    • Change it to return a list instead of working on self.disruptions_list
  • Add build_disruptions_by_name as a replacement to call_random_disrupt_method
  • Move nemesis_multiply_factor option to shuffle_list_of_disruptions instead
    • So it can be used in build_disruptions_by_name
  • Remove factor scaling based on test duration
    • It seemed inconsistent and in conflict with nemesis_multiply_factor
  • Complex Nemesis now use build_disruptions_by_name and call_next_nemesis() similar to SisyphusMonkey
  • Remove AllMonkey and ChaosMonkey, they were the same as SisyphusMonkey after the changes

This change could be theoretically extracted out of this PR, but it would mean only one of the execution patch would call the new code. I consider ComplexNemesis as mostly outdated so this should not be a big issue

@pehala pehala force-pushed the extract_nemesis_discovery branch from cd2b90d to 4d81b4a Compare April 4, 2025 13:24
@pehala pehala changed the title improvement(nemesis): Extract nemesis discovery improvement(nemesis): Rework nemesis discovery Apr 4, 2025
@pehala pehala force-pushed the extract_nemesis_discovery branch 2 times, most recently from 7aa77d8 to 0e034cf Compare April 4, 2025 18:46
pehala added 4 commits April 8, 2025 07:13
…lass

* Add NemesisRegistry class
  * It is Responsible for discovering and filtering NemesisClasses and disrupt methods
  * Doesnt need to instance Nemesis class
  * Reduced number of method comapred to previous:
    * get_disrupt_methods for filtering, takes in logical_phrase
    * gather_properties for exporting
* Change nemesis.yml structure
  * Now it is a pure dict, instead of list of strings
* test_nemesis_sisyphus.py no longer needs Fake classes to generate the .yml files
* Speed up nemesis discovery by checking source code only for the disrupt method
* Change nemesis binding to be based on Class instead of an Instance
LimitedChaosMonkey, GeminiNonDisruptiveChaosMonkey, GeminiChaosMonkey, NetworkMonkey,
NonDisruptiveMonkey, DisruptiveMonkey, FreeTierSetMonkey, SlaNemesis
…_methods

Previously it worked because we already had a disrupt_add_remove_dc disrupt method.
Change it so it actually tests @Nemesis.add_disrupt_method
Currently, two separate mechanisms exist to call disrupt methods:
build_list_of_disruptions used by SisyphusMonkey
call_random_disrupt_method used by all complex monkeys
These two mechanism differ in yaml properties it supports, in usage of random_seed
and how they discover disrupt methods.
Former using NemesisRegistry, later filtering it directly.
This commit unities both of the usecases under one code

Remove AllMonkey and ChaosMonkey as they can be replaced by SisiphusMonkey
@pehala pehala force-pushed the extract_nemesis_discovery branch from 3f3fa05 to dfeca9c Compare April 8, 2025 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants