Skip to content

Commit 1e31046

Browse files
committed
Add precise prefix cache aware benchmarking config
1 parent 95fd944 commit 1e31046

File tree

4 files changed

+250
-8
lines changed

4 files changed

+250
-8
lines changed
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Precise Prefix Cache Aware Benchmarking Helm Chart
2+
3+
## Prerequisites
4+
5+
Before you begin, ensure you have the following:
6+
7+
* **Helm 3+**: [Installation Guide](https://helm.sh/docs/intro/install/)
8+
* **Kubernetes Cluster**: Access to a Kubernetes cluster
9+
* **Gateway Deployed**: Your inference server/gateway must be deployed and accessible within the cluster.
10+
* **Hugging Face Token Secret**: A Hugging Face token to pull models.
11+
12+
## Shared Prefix Dataset Configuration
13+
14+
The chart uses the `shared_prefix` dataset type, which is designed to test caching efficiency. These parameters are located under config.data.shared_prefix:
15+
16+
* `num_groups`: The number of shared prefix groups.
17+
* `num_prompts_per_group`: The number of prompts within each shared prefix group.
18+
* `system_prompt_len`: The length of the system prompt.
19+
* `question_len`: The length of the question part of the prompt.
20+
* `output_len`: The desired length of the model's output.
21+
22+
The default values for the dataset are defined in the chart, but you can override them using `--set config.data.shared_prefix.<parameter>` flags.
23+
24+
Example:
25+
26+
```bash
27+
helm install my-release ../inference-perf -f high-cache-values.yaml --set config.data.shared_prefix.num_groups=512
28+
```
29+
30+
## Deployment
31+
32+
This chart supports two main configurations, defined in `high-cache-values.yaml` and `low-cache-values.yaml`.
33+
34+
### 1. Deploying the High-Cache Configuration
35+
36+
This configuration is optimized for scenarios where a high cache hit rate is expected. It uses the `high-cache-values.yaml` file.
37+
38+
```bash
39+
cd gateway-api-inference-extension/benchmarking/precise-prefix-cache-aware
40+
export IP='<YOUR_IP>'
41+
export PORT='<YOUR_PORT>'
42+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
43+
helm install high-cache ../inference-perf -f high-cache-values.yaml \
44+
--set hfToken=${HF_TOKEN} \
45+
--set "config.server.base_url=http://${IP}:${PORT}"
46+
```
47+
48+
**Parameters to customize:**
49+
50+
* `high-cache`: A unique name for this deployment.
51+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
52+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
53+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
54+
55+
### 2. Deploying the Low-Cache Configuration
56+
57+
This configuration is designed for scenarios with a lower cache hit rate. It uses the `low-cache-values.yaml` file.
58+
59+
```bash
60+
cd gateway-api-inference-extension/benchmarking/precise-prefix-cache-aware
61+
export IP='<YOUR_IP>'
62+
export PORT='<YOUR_PORT>'
63+
export HF_TOKEN='<YOUR_HUGGINGFACE_TOKEN>'
64+
helm install low-cache ../inference-perf -f low-cache-values.yaml \
65+
--set hfToken=${HF_TOKEN} \
66+
--set "config.server.base_url=http://${IP}:${PORT}"
67+
```
68+
69+
**Parameters to customize:**
70+
71+
* `low-cache`: A unique name for this deployment.
72+
* `hfTokenSecret.name`: The name of your Kubernetes Secret containing the Hugging Face token (default: `hf-token`).
73+
* `hfTokenSecret.key`: The key in your Kubernetes Secret pointing to the Hugging Face token (default: `token`).
74+
* `config.server.base_url`: The base URL (IP and port) of your inference server for the high-cache scenario.
75+
76+
## Uninstalling the Charts
77+
78+
To uninstall the deployed charts:
79+
80+
```bash
81+
helm uninstall my-high-cache-release
82+
helm uninstall my-low-cache-release
83+
```
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# High-Cache Configuration
2+
job:
3+
image:
4+
repository: quay.io/inference-perf/inference-perf
5+
tag: "latest" # Defaults to .Chart.AppVersion
6+
serviceAccountName: ""
7+
nodeSelector: {}
8+
# Example resources:
9+
# resources:
10+
# requests:
11+
# cpu: "1"
12+
# memory: "4Gi"
13+
# limits:
14+
# cpu: "2"
15+
# memory: "8Gi"
16+
resources: {}
17+
18+
logLevel: INFO
19+
20+
# A GCS bucket path that points to the dataset file.
21+
# The file will be copied from this path to the local file system
22+
# at /dataset/dataset.json for use during the run.
23+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
24+
gcsPath: ""
25+
26+
# hfToken optionally creates a secret with the specified token.
27+
# Can be set using helm install --set hftoken=<token>
28+
hfToken: ""
29+
30+
config:
31+
load:
32+
type: constant
33+
interval: 15
34+
stages:
35+
- rate: 100
36+
duration: 30
37+
- rate: 200
38+
duration: 30
39+
- rate: 300
40+
duration: 30
41+
- rate: 400
42+
duration: 30
43+
- rate: 500
44+
duration: 30
45+
- rate: 600
46+
duration: 30
47+
- rate: 700
48+
duration: 30
49+
- rate: 800
50+
duration: 30
51+
worker_max_concurrency: 1000
52+
api:
53+
type: completion
54+
streaming: true
55+
server:
56+
type: vllm
57+
model_name: meta-llama/Llama-3.1-8B-Instruct
58+
base_url: http://0.0.0.0:8000
59+
ignore_eos: true
60+
tokenizer:
61+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
62+
data:
63+
type: shared_prefix
64+
shared_prefix:
65+
num_groups: 256
66+
num_prompts_per_group: 16
67+
system_prompt_len: 2048
68+
question_len: 256
69+
output_len: 256
70+
metrics:
71+
type: prometheus
72+
prometheus:
73+
google_managed: true
74+
report:
75+
request_lifecycle:
76+
summary: true
77+
per_stage: true
78+
per_request: true
79+
prometheus:
80+
summary: true
81+
per_stage: true
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Low-Cache Configuration
2+
job:
3+
image:
4+
repository: quay.io/inference-perf/inference-perf
5+
tag: "latest" # Defaults to .Chart.AppVersion
6+
serviceAccountName: ""
7+
nodeSelector: {}
8+
# Example resources:
9+
# resources:
10+
# requests:
11+
# cpu: "1"
12+
# memory: "4Gi"
13+
# limits:
14+
# cpu: "2"
15+
# memory: "8Gi"
16+
resources: {}
17+
18+
logLevel: INFO
19+
20+
# A GCS bucket path that points to the dataset file.
21+
# The file will be copied from this path to the local file system
22+
# at /dataset/dataset.json for use during the run.
23+
# NOTE: For this dataset to be used, config.data.path must also be explicitly set to /dataset/dataset.json.
24+
gcsPath: ""
25+
26+
# hfToken optionally creates a secret with the specified token.
27+
# Can be set using helm install --set hftoken=<token>
28+
hfToken: ""
29+
30+
config:
31+
load:
32+
type: constant
33+
interval: 15
34+
stages:
35+
- rate: 100
36+
duration: 30
37+
- rate: 200
38+
duration: 30
39+
- rate: 300
40+
duration: 30
41+
- rate: 400
42+
duration: 30
43+
- rate: 500
44+
duration: 30
45+
- rate: 600
46+
duration: 30
47+
- rate: 700
48+
duration: 30
49+
- rate: 800
50+
duration: 30
51+
worker_max_concurrency: 1000
52+
api:
53+
type: completion
54+
streaming: true
55+
server:
56+
type: vllm
57+
model_name: meta-llama/Llama-3.1-8B-Instruct
58+
base_url: http://0.0.0.0:8000
59+
ignore_eos: true
60+
tokenizer:
61+
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B-Instruct
62+
data:
63+
type: shared_prefix
64+
shared_prefix:
65+
num_groups: 256
66+
num_prompts_per_group: 16
67+
system_prompt_len: 256 # Low-cache setting
68+
question_len: 2048 # Low-cache setting
69+
output_len: 256
70+
metrics:
71+
type: prometheus
72+
prometheus:
73+
google_managed: true
74+
report:
75+
request_lifecycle:
76+
summary: true
77+
per_stage: true
78+
per_request: true
79+
prometheus:
80+
summary: true
81+
per_stage: true

site-src/performance/benchmark/index.md

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ For more parameter customizations, refer to inference-perf [guides](https://gith
4646

4747
### Storage Parameters
4848

49-
Note: Currently inference-perf outputs benchmark results to standard output only, and results will be deleted once pod is finished running the job.
49+
> Note: Currently inference-perf outputs benchmark results to standard output only, and results will be deleted once pod is finished running the job.
5050
5151

5252
#### 1. Local Storage (Default)
@@ -74,14 +74,11 @@ storage:
7474
###### 🚨 GCS Permissions Checklist (Required for Write Access)
7575

7676
1. **IAM Role (Service Account):** Bound to the target bucket.
77+
* **Minimum:** **Storage Object Creator** (`roles/storage.objectCreator`)
78+
* **Full:** **Storage Object Admin** (`roles/storage.objectAdmin`)
7779

78-
* **Minimum:** **Storage Object Creator** (`roles/storage.objectCreator`)
79-
80-
* **Full:** **Storage Object Admin** (`roles/storage.objectAdmin`)
81-
82-
2. **Node Access Scope (GKE Node Pool):** Set during node pool creation.
83-
84-
* **Required Scope:** **`devstorage.read_write`** or **`cloud-platform`**
80+
2. **Node Access Scope (GKE Node Pool):** Set during node pool creation
81+
* **Required Scope:** **`devstorage.read_write`** or **`cloud-platform`**
8582

8683
#### 3. Simple Storage Service (S3)
8784

0 commit comments

Comments
 (0)