Skip to content

Commit 511d3ae

Browse files
authored
Scripts to run and extract benchmark data in hack (#889)
* script to extract benchmark data * re-org benchmark extract and add benchmark spawn * add customizations * dump metrics from prom * changes to script * clean up readme and reorg * address review * fix base path issue * address typo
1 parent b0010c6 commit 511d3ae

14 files changed

+1882
-0
lines changed

hack/benchmark/README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# WVA Autoscaler Benchmarking
2+
3+
This directory contains the tools and scripts to automate the end-to-end benchmarking of the Workload Variant Autoscaler (WVA) against standard HPA baselines. It orchestrates deploying the environment, running GuideLLM synthetic workloads, extracting metrics, and generating comparison reports.
4+
5+
## Architecture & Directory Structure
6+
7+
- `run/`: Contains `run_ci_benchmark.sh`, the master orchestration script. It handles teardown, standup, baseline/WVA execution, and metric extraction. Also contains scripts for triplicate runs.
8+
- `scenarios/`: Custom workload profiles (e.g., `prefill_heavy`, `decode_heavy`, `symmetrical`) which are automatically injected during the benchmark run.
9+
- `extract/`: Contains `get_benchmark_report.py` for generating PDF plots and latency statistics from the offline data. (Run automatically by the CI script).
10+
- `dump_epp_fc_metrics/`: Contains scripts to dump raw Prometheus metrics into offline JSON for analysis. (Run automatically by the CI script).
11+
12+
## Setup Instructions
13+
14+
This benchmarking suite acts as a wrapper around the `llm-d-benchmark` repository.
15+
16+
### 1. Clone the Benchmark Repository
17+
Ensure `llm-d-benchmark` is cloned **inside the `wva-autoscaler` root directory**:
18+
```bash
19+
cd /path/to/wva-autoscaler
20+
git clone https://github.com/llm-d/llm-d-benchmark.git
21+
```
22+
23+
### 2. Export Required HuggingFace Token
24+
25+
The `llm-d-benchmark` deployment layer strictly requires a HuggingFace authentication token to spin up the vLLM modelservice endpoint (even for public/non-gated models).
26+
You **MUST** export your token to your shell environment before initiating the test orchestrator:
27+
```bash
28+
export LLMDBENCH_HF_TOKEN="hf_your_token_here"
29+
```
30+
31+
## Running Benchmarks
32+
33+
Run the main orchestrator script directly from its location in this repository:
34+
35+
```bash
36+
cd run
37+
./run_ci_benchmark.sh -n "my-namespace" -m "Qwen/Qwen3-0.6B" -s "inference-scheduling" -w "symmetrical"
38+
```
39+
40+
### Configuration Flags
41+
42+
| Flag | Default | Description |
43+
|---|---|---|
44+
| `-n` | `default` | The Kubernetes namespace to use for the benchmark. |
45+
| `-m` | `Qwen/Qwen3-0.6B` | The model to deploy and benchmark. |
46+
| `-s` | `inference-scheduling` | The scenario file to use during the standup phase. |
47+
| `-w` | `chatbot_synthetic` | The workload profile to simulate (e.g., `chatbot_synthetic`, `symmetrical`). It will auto-detect matching profiles in `scenarios/`. |
48+
| `-d` | *(none)* | Enable Direct HPA mode (Bypasses WVA scaling logic). |
49+
| `-t` | *(none)* | Apply a custom WVA Threshold ConfigMap path (e.g., `-t ../scenarios/wva_threshold/wva-threshold-config.yaml`). |
50+
51+
### Direct HPA Baseline (-d)
52+
53+
To run a baseline benchmark using the standard Kubernetes Horizontal Pod Autoscaler (HPA) instead of WVA, pass the `-d` flag. This will:
54+
1. Deploy the standard environment.
55+
2. Scale the WVA controller down to 0, completely disabling its scaling logic.
56+
3. Deploy a custom direct HPA targeting the decode model server directly on queue size and running requests metrics.
57+
58+
### Automated Metrics Extraction Phase
59+
60+
After the GuideLLM load generation completes, `run_ci_benchmark.sh` automatically performs **Step 6**. It will:
61+
1. Identify the newly generated GuideLLM results on the remote PVC.
62+
2. Download them locally to `wva-autoscaler/exp_data/`.
63+
3. Execute `dump_all_metrics.py` to drain Prometheus metrics inside the benchmark time-window boundaries.
64+
4. Execute `get_benchmark_report.py` to plot hardware capacity, response patterns (TTFT/ITL), and autoscaling behavior, cleanly packaged into a PDF report.
65+
66+
You do not need to run python extraction scripts manually unless you want to re-generate the plot from cached offline data.
67+
68+
> [!NOTE]
69+
> **Python Dependencies**: `run_ci_benchmark.sh` automatically creates an isolated local virtual environment (`hack/benchmark/.venv`) and installs the exact library versions from `requirements.txt` to guarantee deterministic extraction. If you intend to run the `dump_all_metrics.py` or `get_benchmark_report.py` scripts standalone, ensure you either activate this virtual environment or install the libraries manually:
70+
> ```bash
71+
> source .venv/bin/activate
72+
> # or manually
73+
> pip install -r requirements.txt
74+
> ```

hack/benchmark/dump_epp_fc_metrics/dump_all_metrics.py

Lines changed: 360 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
#!/usr/bin/env python3
2+
import argparse
3+
import json
4+
import os
5+
import datetime
6+
import yaml
7+
8+
try:
9+
from dump_all_metrics import check_privileges, query_prometheus_range
10+
except ImportError as e:
11+
print(f"Error importing from dump_all_metrics: {e}")
12+
print("Ensure this script is running in the same directory as dump_all_metrics.py")
13+
exit(1)
14+
15+
def main():
16+
parser = argparse.ArgumentParser(description="Dump Flow Control metrics for a specific run.")
17+
parser.add_argument(
18+
"-n", "--namespace",
19+
default="default",
20+
help="The namespace to query (default: default)"
21+
)
22+
parser.add_argument(
23+
"-r", "--results-dir",
24+
required=True,
25+
help="Path to the GuideLLM exp-docs folder to parse results and save the dump."
26+
)
27+
parser.add_argument(
28+
"-t", "--token",
29+
default=None,
30+
help="OpenShift login token."
31+
)
32+
parser.add_argument(
33+
"-s", "--server",
34+
default=None,
35+
help="OpenShift server URL."
36+
)
37+
args = parser.parse_args()
38+
39+
check_privileges(args.token, args.server)
40+
41+
results_dir = args.results_dir
42+
if not os.path.exists(results_dir):
43+
print(f"Error: Directory {results_dir} does not exist.")
44+
exit(1)
45+
46+
yaml_first = os.path.join(results_dir, "benchmark_report,_results.json_0.yaml")
47+
yaml_first_alt = os.path.join(results_dir, "benchmark_report_v0.2,_results.json_0.yaml")
48+
start_yaml = yaml_first if os.path.exists(yaml_first) else (yaml_first_alt if os.path.exists(yaml_first_alt) else None)
49+
50+
yaml_last = os.path.join(results_dir, "benchmark_report,_results.json_3.yaml")
51+
yaml_last_alt = os.path.join(results_dir, "benchmark_report_v0.2,_results.json_3.yaml")
52+
end_yaml = yaml_last if os.path.exists(yaml_last) else (yaml_last_alt if os.path.exists(yaml_last_alt) else start_yaml)
53+
54+
start_time = None
55+
end_time = None
56+
if start_yaml and end_yaml:
57+
try:
58+
with open(start_yaml, 'r') as f:
59+
d_start = yaml.safe_load(f)
60+
start_time = float(d_start['metrics']['time']['start']) - 60
61+
with open(end_yaml, 'r') as f:
62+
d_end = yaml.safe_load(f)
63+
end_time = float(d_end['metrics']['time']['stop']) + 60
64+
print(f"Time window automatically derived: {datetime.datetime.fromtimestamp(start_time)} to {datetime.datetime.fromtimestamp(end_time)}")
65+
except Exception as e:
66+
print(f"Warning: Failed to parse exact benchmark bounds from YAMLs: {e}")
67+
68+
if not start_time or not end_time:
69+
print("Error: Could not determine start and end times from YAML files in the results directory.")
70+
exit(1)
71+
72+
namespace = args.namespace
73+
74+
# We use sum() for size and bytes to aggregate across instances/queues.
75+
# Other metrics like duration are histograms or summaries in Prometheus, so we generally look at the raw metric.
76+
metrics = {
77+
# Flow Control Metrics
78+
"queue_duration": f'inference_extension_flow_control_request_queue_duration_seconds{{namespace="{namespace}"}}',
79+
"queue_size": f'sum(inference_extension_flow_control_queue_size{{namespace="{namespace}"}})',
80+
"queue_bytes": f'sum(inference_extension_flow_control_queue_bytes{{namespace="{namespace}"}})',
81+
"dispatch_cycle": f'inference_extension_flow_control_dispatch_cycle_duration_seconds{{namespace="{namespace}"}}',
82+
"enqueue_duration": f'inference_extension_flow_control_request_enqueue_duration_seconds{{namespace="{namespace}"}}',
83+
"pool_saturation": f'inference_extension_flow_control_pool_saturation{{namespace="{namespace}"}}',
84+
85+
# General EPP Objective Metrics
86+
"request_total": f'inference_objective_request_total{{namespace="{namespace}"}}',
87+
"request_error_total": f'inference_objective_request_error_total{{namespace="{namespace}"}}',
88+
"running_requests": f'inference_objective_running_requests{{namespace="{namespace}"}}',
89+
"request_duration_seconds": f'inference_objective_request_duration_seconds_sum{{namespace="{namespace}"}}',
90+
"request_duration_count": f'inference_objective_request_duration_seconds_count{{namespace="{namespace}"}}',
91+
"request_sizes_bytes": f'inference_objective_request_sizes_sum{{namespace="{namespace}"}}',
92+
"response_sizes_bytes": f'inference_objective_response_sizes_sum{{namespace="{namespace}"}}',
93+
"input_tokens": f'inference_objective_input_tokens_sum{{namespace="{namespace}"}}',
94+
"output_tokens": f'inference_objective_output_tokens_sum{{namespace="{namespace}"}}',
95+
"prompt_cached_tokens": f'inference_objective_prompt_cached_tokens_sum{{namespace="{namespace}"}}',
96+
"normalized_ttft": f'inference_objective_normalized_time_per_output_token_seconds_sum{{namespace="{namespace}"}}',
97+
98+
# EPP Pool Metrics
99+
"pool_per_pod_queue_size": f'inference_pool_per_pod_queue_size{{namespace="{namespace}"}}',
100+
"pool_average_queue_size": f'inference_pool_average_queue_size{{namespace="{namespace}"}}',
101+
"pool_average_kv_cache_utilization": f'inference_pool_average_kv_cache_utilization{{namespace="{namespace}"}}',
102+
"pool_ready_pods": f'inference_pool_ready_pods{{namespace="{namespace}"}}'
103+
}
104+
105+
dump_data = {}
106+
107+
for name, query in metrics.items():
108+
print(f"Fetching metric: {name} ...")
109+
# EPP metrics are in user workload monitoring
110+
data = query_prometheus_range(query, start_time, end_time, step="15s", user_workload=True)
111+
dump_data[name] = data
112+
113+
output_path = os.path.join(results_dir, "epp_metrics_dump.json")
114+
with open(output_path, "w") as f:
115+
json.dump(dump_data, f, indent=2)
116+
117+
print(f"✅ Successfully dumped all EPP metrics to {output_path}")
118+
119+
if __name__ == "__main__":
120+
main()

0 commit comments

Comments
 (0)