[RFC/PROPOSAL]: OSB Feedback Mechanism for Redline Testing

## Summary

This design document proposes implementing a feedback mechanism in OpenSearch Benchmark (OSB) that enables dynamic adjustment of client load based on cluster performance metrics. The primary goal is to support a new ramp-up feature that automatically finds cluster breaking points by monitoring real-time performance indicators and adjusting the number of active clients accordingly.

### Background

Historically, OSB operated with a fixed number of clients throughout a benchmark run, requiring either manual intervention to determine optimal load levels, or using a [feature introduced by the KNN team](https://github.com/opensearch-project/opensearch-benchmark/pull/614) that allows users to repeat an operation using different search client numbers during a benchmark. A user might specify "clients_list": [1,5] in their parameters, and OSB would schedule a task to run once with 1, and once with 5 clients. While OSB effectively distributes tasks across workers and clients, it lacks the ability to dynamically adjust load based on cluster performance, making it difficult to precisely identify cluster breaking points, and it is clear that finding the maximum load for a cluster is important to users. The newly introduced ramp-up feature solves this problem upto a certain level by slowly ramping up clients one-by-one, but the user still has to keep a close eye on the metrics to identify the breaking point.

### Problem Statement

Users would like to identify exact breaking points of OpenSearch clusters under load, but the current manual process is time-consuming and imprecise. Users can run operations with different client numbers using the feature introduced by the KNN team, but this still requires users to come up with a list of client numbers, which may not be accurate. This feedback mechanism would introduce a fully automated way to determine the maximum load a cluster can handle. Introducing this feedback mechanism would offer a more granular level of control, and allow OSB to accurately gauge the maximum load an OpenSearch cluster can withstand.

### Stakeholders/Customers

* AOS/AOSS and OpenSearch users who want to find a clusters’ performance limits
* OpenSearch developers implementing new features, who want to ensure there are no performance regressions/bottlenecks or want to compare against baseline thresholds. 
* Performance engineers who are interested in finding the breaking point of their OpenSearch clusters before taking it into production or evaluating scaling needs for a high-traffic event. 

### User Stories

* As an OpenSearch user, I want to discover my cluster’s performance limits during benchmarking so I can confidently size my production deployment.
* As an OpenSearch developer, I want to automatically determine cluster limits without manual benchmarking iterations so I can efficiently test different cluster configurations.
* As an OpenSearch developer, I want to quickly measure the performance impact of my code changes so I can identify potential bottlenecks before merging.
* As an OpenSearch developer, I want to automatically detect performance regressions during testing so I can ensure new releases maintain or improve performance standards.

## Current Design
<img width="749" alt="Image" src="https://github.com/user-attachments/assets/dea0b998-b35e-48f8-9d08-bc7a20a11efc" />

### Brief Summary of Actor Model System

The Actor Model is a conceptual framework for designing concurrent and distributed systems. Instead of using shared memory and locks, it uses message passing between isolated units called "actors."  Opensearch Benchmark is based on actor model system. Consider each actor as a separate process, and each client as a separate thread. By default OSB creates `n` worker actors, where `n` is the number of CPU cores and then assigns client equally to each worker process. For e.g., if there are 10 worker actors and 20 clients, then each worker actor will execute 2-clients concurrently. 

### Core Actor Model Concepts

1. Actors as Fundamental Units: Each actor is an independent computational entity with: 

    * Private state that can't be directly accessed by other actors
    * The ability to process messages sequentially
    * The ability to create new actors
    * The ability to send messages to other actors

1. Message Passing: All communication happens through asynchronous messages.
2. Location Transparency: Actors don't need to know where other actors are physically located.

A brief summary of all the actors implemented by OSB can be found [here](https://quip-amazon.com/NGawAiie6y7n/OSB-feedback-mechanism-analysis#temp:C:JUO8a9b2dffab1442f681f990737).

**Here are the key components from the above diagram:**
**BenchmarkActor**: A base class for several actors in the system. Keeps track of and manages child actors such as the WorkerCoordinatorActor. 

**WorkerCoordinatorActor**: Creates and manages the lifecycle of the worker actors. Also handles updating the samples collected by individual clients.

**Worker**: Manage allocated search clients. The number of workers provisioned is based on the number of cores in the Load Generation Host (Where OSB is running).

**AsyncIoAdapter**: Creates individual clients. See the code [here](https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/worker_coordinator/worker_coordinator.py#L1488)

**Client**: Search clients, the number of clients provisioned is defined in the workload. Handles indexing & sending requests to the cluster (see the [AsyncExecutor](https://github.com/opensearch-project/opensearch-benchmark/blob/main/osbenchmark/worker_coordinator/worker_coordinator.py#L1607) class).

Currently, OSB can slowly scale up client numbers to a target throughput. @rishabh6788 recently introduced a [ramp up task property](https://github.com/opensearch-project/opensearch-benchmark/pull/725), which along with the new --load-test-qps [command](https://github.com/opensearch-project/opensearch-benchmark/pull/739), allows users to ramp up to a target number of throughput/clients over time with a simple command flag along with an appropriate test procedure:
```
{
  "name": "timed-mode-test-procedure",
  "schedule": [
    {
       "operation": "range",
       "warmup-time-period": {{ warmup_time | default(300) | tojson }},
       "ramp-up-time-period": {{ ramp_up_time | default(100) | tojson }},
       "time-period": {{ time_period | default(300) | tojson }},
       "target-throughput": {{ target_throughput | default(20) | tojson }},
       "clients": {{ search_clients | default(20) }}
    }
  ]
}
```
OpenSearch Benchmark can run either in iteration-mode or timed-mode:

* **Iteration mode**: Each task has defined warm-up and iterations values, and the task is run for the defined number of iterations before the end results are calculated.
* **Timed mode**: Each task has defined warm-up and time-period values, and the task is run for the duration declared in time-period. Results are calculated based on the number of requests sent to the cluster during that time-period.

The new task parameter ramp-up-time-period tells OSB during a timed mode test, to bring up clients over the time defined by ramp-up-time-period, rather than spinning them all up at once and flooding the cluster with traffic. For more detail on this new task property, read @rishabh6788 proposal [here](https://github.com/opensearch-project/opensearch-benchmark/issues/731).

Despite this test procedure with the new task property, OSB is still blindly ramping up the number of clients over time without any indication of where the cluster could be breaking. Users will still need to make an educated guess about the QPS in order to get an understanding of the breaking point.

## Proposal

To address the above issues, we propose adding a new feedback actor to the OSB system. This would involve collecting request data during runtime, and sending these up the “chain of command” to the new actor which would have the ability to pause/unpause clients.

This feedback mechanism would give OSB the ability to monitor the cluster more closely and respond appropriately if any issues arise during a load test. It would allow OSB to accurately fine-tune the throughput until just before the point where the cluster cannot handle any more traffic.

The user must use a new `--redline-test` command for the BenchmarkActor to create this new FeedbackActor and its related pieces.

#### Solution 1: Global shared mapping of workers to clients (Chosen)

<img width="898" alt="Image" src="https://github.com/user-attachments/assets/f1890223-739e-4f7e-a92e-859919b66650" />

This approach introduces a new FeedbackActor and a thread-safe shared dictionary, where each worker maps to its associated clients and their pause status.

A sample dictionary would look like:
```
{
   worker-0: {client-0:<boolean>, client-1:<boolean>, ... }
   .
   .
   .
   worker-n: {...client-k-1:<boolean>, client-k:<boolean>}
}
```

- `True` indicates the client is active and sending requests.
- `False` indicates the client is paused.

This shared dictionary is implemented using Python’s `multiprocessing.Manager()` to enable thread-safe access across actors. The FeedbackActor reads this state and adjusts client activity based on request failures reported in a separate shared queue (also managed by `multiprocessing`).

This is preferred over sending messages to the FeedbackActor, because thespianpy actors handle messages in a synchronized, single-threaded manner unless they're in an actor troupe (see [here](https://thespianpy.com/doc/using#hH-19cf92bf-a6ae-492c-844c-b0831c108a3a) for more details). With hundreds or thousands of clients running at once, this will inevitably put a large amount of unnecessary load/stress on OSB.

The FeedbackActor will keep snooping the queue at defined intervals and will access the shared dictionaries, immediately flipping values to ‘False’ to pause a percentage of clients if there are failures in the queue. Otherwise, if there are no failures for some time, the FeedbackActor can begin to ‘unpause’ clients that were paused, if any, by flipping ‘False’ values back to ‘True’. 

Individual client class execution logic will consult the shared dictionary for each client before sending a request, and check whether they should be paused or not. Paused clients will sleep shortly before checking again, and log their current status.

**Pros**:

* Less invasive implementation
* Simpler architecture; single point of control
* Easier to maintain and update global state

**Cons**:
* Potential bottleneck with all clients accessing the same dictionary
* Higher contention for shared resource access

#### Solution 2: Introduce a global client dictionary

This solution involves a single shared dictionary mapping client IDs directly to their pause state, e.g.:

```
{
   client-0: True,
   client-1: False,
   client-2: True,
   ...
}
```

Each client consults this dictionary before executing a request. The FeedbackActor updates values based on cluster health by pulling from a shared error queue. This approach avoids worker-level grouping and allows centralized state management.

**Pros**:
- Simplest implementation — no worker/client hierarchy needed
- Minimal bookkeeping for client states
- Still allows runtime feedback and control

**Cons**:

* FeedbackActor is blind to which Workers control which clients
    * Risk of uneven load distribution across workers
* Potential bottleneck with all clients accessing the same dictionary
* Less granular control over load distribution
* Higher contention for shared resource access

#### Solution 3: Have Workers control their assigned clients

![Image](https://github.com/user-attachments/assets/777da610-1d9c-4d05-bd87-418423c19a79)

This solution delegates control to each Worker, who manages its own client dictionary:

```
{
   worker-0: {client-0: True, client-1: False},
   worker-1: {client-2: True, client-3: True},
   ...
}
```

The FeedbackActor would send pause/unpause instructions to workers, who then update their internal state and instruct clients accordingly. Clients check their worker’s internal state before sending a request.

Sample mapping would look like, assuming k clients spread over n workers:

```
{
   worker-0: {client-0:<pause status>, client-1:<pause status>, ... }
   .
   .
   .
   worker-n: {...client-k-1:<pause status>, client-k:<pause status>}
}
```

**Pros**:

* Better load balancing control
* Reduced contention on shared resources
* More granular control over client distribution
* Easier to scale horizontally

**Cons**:

* More complex implementation
* Higher memory overhead from multiple dictionaries

#### How is request data collected during runtime?

By default, OSB collects ‘samples’ after each individual request and sends them to the WorkerCoordinatorActor. The WorkerCoordinatorActor then throws these ‘raw’ samples into a SamplePostProcessor every N seconds (default: 30). This class takes the raw samples and stores them in the metrics store. These samples include the request metadata, which includes a ‘success’ parameter: success: True/False

Clients, before sending their samples back, can check the request metadata to see if the request failed. If it failed, then add the failed request metadata to the shared Queue to alert the FeedbackActor of the failed request.

#### What should happen to the message queue while the FeedbackActor is adjusting client load?

Naturally, if a cluster is overloaded, it may not just stop at 1 failed request. Many clients could fail at the same time, and without proper message handling, the FeedbackActor could easily end up overcorrecting and pausing many (if not all) of the clients at once.

To prevent this, the message queue handled by the FeedbackActor should be a blocking queue. When error messages are received, the queue should block any new incoming messages until the proper adjustments are made. If the correction made was not enough to prevent errors, then new messages will come and the FeedbackActor will make another adjustment, pausing more clients.

#### What happens when the cluster recovers after pausing clients?

After making an adjustment to the active client count, the FeedbackActor should give the cluster a grace period (maybe 30 seconds) to allow the cluster time to recover/adjust to the new client count. 

Once the cluster has recovered, if there are no error messages for a set period of time, the FeedbackActor should begin to unpause clients over time.
If failures are detected again, then this gradual unpause should stop so the FeedbackActor can readjust client count again.

#### What will the results look like?

The maximum load the cluster was able to handle will be surfaced to the console once the test is finished, alongside other statistics that OSB already calculates like indexing time, throughput, latency etc.

## Design Details

### Functional Requirements

#### Activation Requirements

* The load test with feedback mechanism is only enabled when:
    * User provides the `--redline-test`  flag
    * Test procedure uses the timed-mode structure with a ramp-up-time-period value.
* System validates both conditions before initializing FeedbackActor
* All other benchmark configurations should skip the creation of the feedback actor, and clients should not perform any checks or attempt to send failed requests to a feedback actor.

#### Architecture Overview

* New Feedback Actor component that monitors and controls client load
* Shared dictionary using `multiprocessing.Manager()`  to control individual clients
* Communication flow between FeedbackActor, Workers, and clients

#### Key Components

* FeedbackActor
    * Receives failed request metadata
    * Controls client activation/deactivation via shared dictionary/array
    * Implements gradual ramp-up and down logic
* Shared Control Dictionary
    * Structure: {client_id: boolean} 
    * Managed by multiprocessing.Manager() 
    * Used to control individual client execution
* Client integration
    * Clients check their status in shared dictionary before executing tasks
    * Continues normal operation if active (True)
    * Skips operation if inactive (False)

#### Data Flow

1. Client executes request
2. Metrics gathered after request is completed
3. Unsuccessful response metrics get enqueued to a queue shared with the FeedbackActor
4. FeedbackActor snoops error message queue, blocks queue before adjusting client count
5. FeedbackActor updates shared dictionary, pausing a percentage of the total client count, unblocks message queue
6. Individual clients check dictionary before next execution to ensure they should be running

### Non-functional Requirements

#### Scaling and Performance

* Shared dictionary must efficiently handle lookups for thousands of concurrent clients
    * For slightly improved performance, an array can be used in place of the dictionary
* System should handle rapid client state changes without degradation

#### Latency and Performance

* Dictionary lookups should have negligible impact on client request timing
* Minimal memory overhead from shared dictionary implementation

#### Observability

* Pause/Unpause actions will be logged
* Number of active/paused clients
* Current number of active clients can be surfaced to the console for the user

#### Testing

* Robust unit tests needed as writing tests for dynamic behavior could be challenging

#### Future Considerations

* Support for additional performance metrics (latency, service time, etc.)
    * Extensibility for custom feedback rules
* Potential for more sophisticated control algorithms
    * Configurable parameters, like:
        * Cool-down period before unpausing clients
        * Maximum client pause percentage
        * Minimum activate clients
* Integration with other OSB features
    * aggregate command, test-iterations flag

### References/Other resources

* [Initial document](https://quip-amazon.com/NGawAiie6y7n/OSB-feedback-mechanism-analysis) shared with the benchmarking team by @rishabh6788 to gather information on OSB’s architecture and discuss possible approaches
* Rishabh’s [initial PR](https://github.com/opensearch-project/opensearch-benchmark/pull/725) to introduce the new ramp up parameter for OSB workloads
* [RFC](https://github.com/opensearch-project/opensearch-benchmark/issues/729) Created by Rishabh on the ramp up parameter
* [PR](https://github.com/opensearch-project/opensearch-benchmark/pull/739) Created by me to allow users to call for a load test with a new command flag, —load-test-qps=X
* [Feature](https://github.com/opensearch-project/opensearch-benchmark/pull/614) introduced by the KNN team to run an operation repeatedly with different client counts

[RFC/PROPOSAL]: OSB Feedback Mechanism for Redline Testing #785

Description

Summary

Background

Problem Statement

Stakeholders/Customers

User Stories

Current Design

Brief Summary of Actor Model System

Core Actor Model Concepts

Proposal

Solution 1: Global shared mapping of workers to clients (Chosen)

Solution 2: Introduce a global client dictionary

Solution 3: Have Workers control their assigned clients

How is request data collected during runtime?

What should happen to the message queue while the FeedbackActor is adjusting client load?

What happens when the cluster recovers after pausing clients?

What will the results look like?

Design Details

Functional Requirements

Activation Requirements

Architecture Overview

Key Components

Data Flow

Non-functional Requirements

Scaling and Performance

Latency and Performance

Observability

Testing

Future Considerations

References/Other resources

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions