-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce new FeedbackActor #793
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
…nt dictionaries and send them to the FeedbackActor Signed-off-by: Michael Oviedo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments but overall LGTM
Signed-off-by: Michael Oviedo <[email protected]>
83a605c
to
9749d60
Compare
except Exception as e: | ||
self.logger.error("Error processing client states: %s", e) | ||
|
||
def receiveMsg_StartFeedbackActor(self, msg, sender): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message should only be called once all the workers have updated the client mapping.
We need to have some kind of validation done to confirm if all workers have updated the client mappings, send an ack back to calling actor and then that actor can start the feedback actor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, noting this - I'll add this in a follow up revision
except Exception as e: | ||
self.logger.error("Error processing client states: %s", e) | ||
|
||
def receiveMsg_StartFeedbackActor(self, msg, sender): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actor should also call the handle_state, not receiveMsg_SharedClientStateMessage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message will start the wake up loop which in turn will begin calling handle_state
self.messageQueue.clear() | ||
self.sleep_start_time = time.perf_counter() | ||
|
||
def scale_up(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bringing up 1 client at a time will be time consuming, especially when spinning up 1000s of clients.
How about we bring up clients in steps of n or multiple of n and also keep a check on self.state
, if it is set to scale down whenever an error message is received, we can just ignore the logic and do nothing? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, 1 client/sec is time consuming. For now, I changed it to 5/sec but can increase it if 10 or more if you think it should be faster to start.
I think the scale_up function should just take a number of clients as an argument and attempt to activate that many clients round-robin style until it hits the target goal. For now it's just linear but the logic of changing the number of clients in multiples or steps can be added in a separate function. What do you think of this:
We check the state every second and choose whether to scale up or down depending on the message queue and a couple other factors (whether we errored recently, or scaled up too recently). For future scale-up methods, we can keep another class attribute n
that we manipulate in a separate function if we want to scale up in different ways e.g. exponentially or percentage based, and then call scale_up to activate that many clients. Does that make sense?
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
…or, Workers, clients) Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
During the development of this PR, we discovered a bug where the OSB benchmark would not complete due to Workers not reaching their designated joinpoints. This was caused by how clients handled the 'paused' state. Instead of progressing through their schedules, they would enter a loop where they slept ( As a result, many clients remained stuck which prevented the test from completing. To fix this, we updated the logic so clients now continue executing as normal, but without sending requests when they are meant to be paused. This change allows the FeedbackActor to control the number of active clients sending requests to a cluster and throttle the load generation without interrupting the overall flow of the benchmark. |
@OVI3D0 It would be great if you could add some details around how clients are reporting error to feedback actor using a blocking queue, and how it is different from the method you first implemented. |
We also introduced the use of We decided to go with a shared Queue between the FeedbackActor and individual clients because Thespianpy Actors handle messaging in a single-threaded, synchronized fashion unless using an actor troupe. (see docs). Because of the nature of load testing and the strong chance of hundreds or even thousands of clients failing simultaneously, the original approach would often cause significant lag due to the overwhelming volume of messages being sent synchronously. With shared multiprocessing queue's and locks, individual clients can now simply enqueue failed request metadata to the shared queue without sending any messages, and the FeedbackActor can freely snoop through this queue at regular intervals. This has proven to be able to be far more scalable than the previous implementation in tests with thousands of active clients. |
Signed-off-by: Michael Oviedo <[email protected]>
Description
Introduces a new FeedbackActor. Currently, it is not connected to any moving parts of the OSB system and cannot yet be called on.
Includes a message queue handling system, logic to handle error messages, scale clients up 1 client at a time and down 10% at a time, and sleep post scale down to let the target cluster recover.
The scale up/down logic assumes the shared client dictionaries will look like:
Based on the RFC introduced recently
Issues Resolved
#790
Testing
make it
+make test
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.