Skip to content

[Bug]: Sleep Runner fails to complete with KeyError: 'request_start' #780

Closed
@Gankris96

Description

@Gankris96

Describe the bug

I have a custom procedure with sleep operation after each major operation. Here is an example procedure -

{% import "benchmark.helpers" as benchmark with context %}
{
    "name": "knn-no-train-test",
    "default": false,
    "schedule": [
        {
            "operation": {
                "name": "delete-target-index",
                "operation-type": "delete-index",
                "only-if-exists": true,
                "index": "{{ target_index_name }}"
            }
        },
        {
            "operation": {
                "name": "wait-a-minute-or-more-1",
                "operation-type": "sleep",
                "duration": {{ wait_time_in_sec }}
            }
        },
        {
            "operation": {
                "name": "create-target-index",
                "operation-type": "create-index",
                "index": "{{ target_index_name }}"
            }
        },
        {
            "operation": {
                "name": "wait-a-minute-or-more-2",
                "operation-type": "sleep",
                "duration": {{ wait_time_in_sec }}
            }
        },
        {
            "operation": {
                "name": "custom-vector-bulk",
                "operation-type": "custom-vector-bulk",
                "index": "{{ target_index_name }}",
                "field": "{{ target_field_name }}",
                "bulk_size": {{ target_index_bulk_size }},
                "data_set_format": "{{ target_index_bulk_index_data_set_format }}",
                "data_set_path": "{{ target_index_bulk_index_data_set_path }}",
                "num_vectors": {{ target_index_num_vectors }},
                "timeout": "{{ timeout }}"
            },
            "clients": {{ target_index_bulk_index_clients }}
        },
        {
            "operation": {
                "name": "wait-a-minute-or-more-3",
                "operation-type": "sleep",
                "duration": {{ wait_time_in_sec }}
            }
        }
    ]
}

When i run this -


   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: b07200c1-b031-4fd7-b6ab-c379ceb4c40e
[WARNING] Local changes in [/Users/gramadur/.benchmark/benchmarks/workloads/default] prevent workloads update from remote. Please commit your changes.
[WARNING] Local changes in [/Users/gramadur/.benchmark/benchmarks/workloads/default] prevent workloads update from remote. Please commit your changes.
[WARNING] Local changes in [/Users/gramadur/.benchmark/benchmarks/workloads/default] prevent workloads update from remote. Please commit your changes.
[INFO] Executing test with workload [vectorsearch], test_procedure [knn-no-train-test] and provision_config_instance ['external'] with version [None].

Running delete-target-index                                                    [100% done]
Running wait-a-minute-or-more-1                                                [  0% done]
[ERROR] Cannot execute-test. Benchmark operation failed:
Worker ID: 0
Task: wait-a-minute-or-more-1
Workload: vectorsearch
Test Procedure: knn-no-train-test
Cause: Unknown
Error: Cannot run task [wait-a-minute-or-more-1]: 'request_start'
	Cannot run task [wait-a-minute-or-more-1]: 'request_start'


To reproduce

To Reproduce -

I think any of the procedures, if you add a sleep operation like above would lead to this issue.

{
            "operation": {
                "name": "wait-a-minute-or-more-1",
                "operation-type": "sleep",
                "duration": {{ wait_time_in_sec }}
            }
        },

I have provided an example procedure above for an aoss collection.

For example - you can add the above operation to test_procedures/common/index-only-schedule.json and run the aoss 1m cohere workload -

opensearch-benchmark execute-test --workload=vectorsearch --target-hosts=<ENDPOINT> --client-options="timeout:600,use_ssl:True,verify_certs:True,amazon_aws_log_in:environment,region:us-east-1" --workload-params=faiss-cohere-768-dp.json --test-procedure=no-train-test-index-only --pipeline=benchmark-only  --kill-running-processes --distribution-version=2.17.0

It will fail at the sleep operation that was just added

Expected behavior

Expected behavior - the sleep runner completes successfully by performing sleep for provided duration seconds.

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

And the benchmarking logs show -

2025-03-04 17:35:17,245 ActorAddr-(T|:59907)/PID:69727 osbenchmark.worker_coordinator.worker_coordinator ERROR Could not execute schedule
Traceback (most recent call last):

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 1674, in __call__
    request_start = request_context.request_start

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/client.py", line 54, in request_start
    return self.ctx["request_start"]

KeyError: 'request_start'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 1677, in __call__
    client_request_end = request_context.client_request_end

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/client.py", line 72, in __aexit__
    request_start = self.request_start

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/client.py", line 54, in request_start
    return self.ctx["request_start"]

KeyError: 'request_start'

2025-03-04 17:35:17,265 ActorAddr-(T|:59907)/PID:69727 osbenchmark.worker_coordinator.worker_coordinator INFO Total run duration: 3.022682 seconds.
2025-03-04 17:35:17,269 ActorAddr-(T|:59907)/PID:69727 osbenchmark.worker_coordinator.worker_coordinator INFO Total time to shutdown asyncgens: 0.003797 seconds.
2025-03-04 17:35:17,269 ActorAddr-(T|:59907)/PID:69727 osbenchmark.worker_coordinator.worker_coordinator INFO Total time to close transports: -0.000291 seconds.
2025-03-04 17:35:19,223 ActorAddr-(T|:59907)/PID:69727 osbenchmark.actor ERROR Worker[0] has detected a benchmark failure:
Benchmark operation failed:
Worker ID: 0
Task: wait-a-minute-or-more-1
Workload: vectorsearch
Test Procedure: knn-no-train-test
Cause: Unknown
Error: Cannot run task [wait-a-minute-or-more-1]: 'request_start'
Traceback (most recent call last):

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 1514, in __call__
    loop.run_until_complete(self.run())

  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 1561, in run
    _ = await asyncio.gather(*aws)

  File "/Users/gramadur/Library/Python/3.9/lib/python/site-packages/osbenchmark/worker_coordinator/worker_coordinator.py", line 1725, in __call__
    raise exceptions.BenchmarkError(f"Cannot run task [{self.task}]: {e}") from None

osbenchmark.exceptions.BenchmarkError: Cannot run task [wait-a-minute-or-more-1]: 'request_start'

Relevant log output

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions