Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce new FeedbackActor #793

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

OVI3D0
Copy link
Member

@OVI3D0 OVI3D0 commented Mar 19, 2025

Description

Introduces a new FeedbackActor. Currently, it is not connected to any moving parts of the OSB system and cannot yet be called on.

Includes a message queue handling system, logic to handle error messages, scale clients up 1 client at a time and down 10% at a time, and sleep post scale down to let the target cluster recover.

The scale up/down logic assumes the shared client dictionaries will look like:

{
   worker-0: {client-0:<pause status>, client-1:<pause status>, ... }
   .
   .
   .
   worker-n: {...client-k-1:<pause status>, client-k:<pause status>}
}

Based on the RFC introduced recently

Issues Resolved

#790

Testing

  • New functionality includes testing

make it + make test


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@OVI3D0 OVI3D0 marked this pull request as ready for review March 19, 2025 20:37
Signed-off-by: Michael Oviedo <[email protected]>
@OVI3D0 OVI3D0 linked an issue Mar 20, 2025 that may be closed by this pull request
OVI3D0 added 2 commits March 21, 2025 20:37
…nt dictionaries and send them to the FeedbackActor

Signed-off-by: Michael Oviedo <[email protected]>
Copy link
Collaborator

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments but overall LGTM

@OVI3D0 OVI3D0 force-pushed the feedback-actor branch 2 times, most recently from 83a605c to 9749d60 Compare March 25, 2025 21:12
except Exception as e:
self.logger.error("Error processing client states: %s", e)

def receiveMsg_StartFeedbackActor(self, msg, sender):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message should only be called once all the workers have updated the client mapping.
We need to have some kind of validation done to confirm if all workers have updated the client mappings, send an ack back to calling actor and then that actor can start the feedback actor.

Copy link
Member Author

@OVI3D0 OVI3D0 Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, noting this - I'll add this in a follow up revision

except Exception as e:
self.logger.error("Error processing client states: %s", e)

def receiveMsg_StartFeedbackActor(self, msg, sender):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actor should also call the handle_state, not receiveMsg_SharedClientStateMessage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message will start the wake up loop which in turn will begin calling handle_state

self.messageQueue.clear()
self.sleep_start_time = time.perf_counter()

def scale_up(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing up 1 client at a time will be time consuming, especially when spinning up 1000s of clients.
How about we bring up clients in steps of n or multiple of n and also keep a check on self.state, if it is set to scale down whenever an error message is received, we can just ignore the logic and do nothing? WDYT?

Copy link
Member Author

@OVI3D0 OVI3D0 Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, 1 client/sec is time consuming. For now, I changed it to 5/sec but can increase it if 10 or more if you think it should be faster to start.

I think the scale_up function should just take a number of clients as an argument and attempt to activate that many clients round-robin style until it hits the target goal. For now it's just linear but the logic of changing the number of clients in multiples or steps can be added in a separate function. What do you think of this:

We check the state every second and choose whether to scale up or down depending on the message queue and a couple other factors (whether we errored recently, or scaled up too recently). For future scale-up methods, we can keep another class attribute n that we manipulate in a separate function if we want to scale up in different ways e.g. exponentially or percentage based, and then call scale_up to activate that many clients. Does that make sense?

@OVI3D0
Copy link
Member Author

OVI3D0 commented Apr 3, 2025

During the development of this PR, we discovered a bug where the OSB benchmark would not complete due to Workers not reaching their designated joinpoints. This was caused by how clients handled the 'paused' state. Instead of progressing through their schedules, they would enter a loop where they slept (asyncio.sleep(1)) repeatedly.

As a result, many clients remained stuck which prevented the test from completing.

To fix this, we updated the logic so clients now continue executing as normal, but without sending requests when they are meant to be paused. This change allows the FeedbackActor to control the number of active clients sending requests to a cluster and throttle the load generation without interrupting the overall flow of the benchmark.

@rishabh6788
Copy link
Collaborator

@OVI3D0 It would be great if you could add some details around how clients are reporting error to feedback actor using a blocking queue, and how it is different from the method you first implemented.

@OVI3D0
Copy link
Member Author

OVI3D0 commented Apr 3, 2025

We also introduced the use of multiprocessing.Queue and multiprocessing.Lock to safely coordinate feedback messaging between multiple worker processes and the central FeedbackActor.

We decided to go with a shared Queue between the FeedbackActor and individual clients because Thespianpy Actors handle messaging in a single-threaded, synchronized fashion unless using an actor troupe. (see docs).

Because of the nature of load testing and the strong chance of hundreds or even thousands of clients failing simultaneously, the original approach would often cause significant lag due to the overwhelming volume of messages being sent synchronously.

With shared multiprocessing queue's and locks, individual clients can now simply enqueue failed request metadata to the shared queue without sending any messages, and the FeedbackActor can freely snoop through this queue at regular intervals. This has proven to be able to be far more scalable than the previous implementation in tests with thousands of active clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement new FeedbackActor for load testing in OSB
3 participants