Skip to content

Conversation

@OVI3D0
Copy link
Member

@OVI3D0 OVI3D0 commented Sep 24, 2025

Description

Adds new 'fire and forget' mode to OSB.

Rather than OSB's traditional client model of 'fire request -> await request -> record metrics -> move on', this mode allows each client to fire requests without needing to worry about awaiting responses or recording metrics. This allows OSB to easily sustain high throughput values when load testing cluster, and is intended for those who don't necessarily want precise measurements on each request, but would rather test their clusters against very high sustained throughput levels.

The drawback here is of course that there is no information returned to the user on how their cluster is performing aside from outside forms of polling/measurements, like the performance charts than can be viewed in the AWS console for their cluster.

The PR introduces a new flag, --fire-and-forget, which tells OSB to choose the DeterministicScheduler. Unlike the UnitAwareScheduler used for most benchmarks, this scheduler doesn't care about request metadata, it only calculates throughput for each client and tells them to send requests at a certain rate.

The flag also tells OSB to make use of a new request executor, called the UnhingedExecutor. This executor, unlike the AsyncExecutor, creates separate asynchronous tasks to send requests without awaiting them, ensuring requests are sent at the specified rate no matter the latency or failure rates. To sum it up, if each executor needs to send 2 RPS, this mode tells them to create 2 async tasks per second, each of which will send the request, rather than trying to send 1 request every 0.5 seconds on its own.

This mode can consume resources very quickly, and users should ensure their hardware is able to handle the thousands of async processes that are created when using this flag. It's also likely they will run into an OSerror too many open files with all of the connections being established.

The max number of network sockets can be checked with ulimit -n and changed with the same command, like: ulimit -n 2048

Issues Resolved

#958

Testing

  • New functionality includes testing

New unit tests + running tests in 'fire and forget' mode against OS cluster at 500 TPS:

Screenshot 2025-09-24 at 11 48 18 AM Screenshot 2025-09-24 at 11 47 55 AM

Unit tests produce this warning since we don't await the async tasks:

sys:1: RuntimeWarning: coroutine 'UnhingedExecutor._fire_and_forget_request.<locals>.fire_and_forget_runner' was never awaited
Coroutine created at (most recent call last)
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
    method()
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/unittest/mock.py", line 1325, in patched
    return func(*newargs, **newkeywargs)
  File "/Users/mikeovi/workplace/opensearch-benchmark/tests/__init__.py", line 35, in async_wrapper
    asyncio.run(t(*args, **kwargs), debug=True)
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/asyncio/base_events.py", line 1851, in _run_once
    handle._run()
  File "/Users/mikeovi/.pyenv/versions/3.8.12/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/Users/mikeovi/workplace/opensearch-benchmark/tests/worker_coordinator/worker_coordinator_test.py", line 2454, in test_fire_and_forget_request_no_throttling_needed
    await executor._fire_and_forget_request({}, 1.0, 0.0)  # expected_scheduled_time = 1.0
  File "/Users/mikeovi/workplace/opensearch-benchmark/osbenchmark/worker_coordinator/worker_coordinator.py", line 2603, in _fire_and_forget_request
    task = asyncio.create_task(fire_and_forget_runner())

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
default=None
)
test_run_parser.add_argument(
"--fire-and-forget",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we come up with some other name, like --no-await or --sustain etc?

self.complete.set()
await self._cleanup()

class UnhingedExecutor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Same here, can we change this to AsyncNoAwaitExecutor or something similar to maintain naming convention?

self.logger.info("Client id [%s] is running now.", self.client_id)


async def _fire_and_forget_request(self, params: dict, expected_scheduled_time: float, total_start: float) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same here for the naming.

@rishabh6788
Copy link
Collaborator

LGTM apart from some naming conventions. Can you run test in timed mode, may be for 15-30 mins and share some more results, good idea to test it with ramp up mode as well.

Signed-off-by: Michael Oviedo <[email protected]>
Signed-off-by: Michael Oviedo <[email protected]>
@OVI3D0
Copy link
Member Author

OVI3D0 commented Oct 15, 2025

LGTM apart from some naming conventions. Can you run test in timed mode, may be for 15-30 mins and share some more results, good idea to test it with ramp up mode as well.

Here's a screenshot after running combined with the ramp-up test procedure property:
Screenshot 2025-10-15 at 3 25 00 PM

I used this test procedure:

{
  "name": "ramp-up-test-procedure",
  "schedule": [
    {
       "operation": "range",
       "warmup-time-period": {{ warmup_time | default(900) | tojson }},
       "ramp-up-time-period": {{ ramp_up_time | default(600) | tojson }},
       "time-period": {{ time_period | default(1500) | tojson }},
       "target-throughput": {{ target_throughput | default(2000) | tojson }},
       "clients": {{ search_clients | default(2000) }}
    }
  ]
}

@OVI3D0 OVI3D0 requested a review from rishabh6788 October 20, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants