-
Notifications
You must be signed in to change notification settings - Fork 62
Open
Description
We observe requests freezing on the vLLM sim when running benchmarks with llm-d-benchmark and WVA.
Steps to reproduce:
- In WVA repo install simulated kind environment with the make command -
make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true - downscale model service replica to 1
- After the stack is deployed, addthe below YAML patches to the modelservice and later to HPA
spec:
template:
spec:
containers:
- name: vllm
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
args:
- --model
- unsloth/Meta-Llama-3.1-8B
- --port
- "8200"
- --served-model-name
- unsloth/Meta-Llama-3.1-8B
- --time-to-first-token=200
- --inter-token-latency=20
- --enable-kvcache
- --kv-cache-size=1024
- --block-size=16
- --tokenizers-cache-dir=/tmp
spec:
metrics:
- type: External
external:
metric:
name: wva_desired_replicas
selector:
matchLabels:
variant_name: ms-sim-llm-d-modelservice-decode
exported_namespace: llm-d-sim
target:
type: AverageValue
averageValue: 1
- Start llmd benchmark using command
./run.sh -c scenarios/constant_traffic_20min.shbelow is the scenarios file
#!/usr/bin/env bash
# Simulation Configuration
export LLMDBENCH_HARNESS_NAME="guidellm"
export LLMDBENCH_HARNESS_EXPERIMENT_PROFILE="constant_10rps_20min"
export LLMDBENCH_VLLM_COMMON_NAMESPACE="llm-d-sim"
export LLMDBENCH_HARNESS_NAMESPACE="llm-d-sim"
# WVA Configuration
export LLMDBENCH_WVA_ENABLED=true
# PVC Configuration
export LLMDBENCH_VLLM_COMMON_PVC_ACCESS_MODE="ReadWriteOnce"
# Harness Resources
export LLMDBENCH_HARNESS_CPU_NR=0.5
export LLMDBENCH_HARNESS_CPU_MEM=1Gi
# Explicit Endpoint Override
export LLMDBENCH_HARNESS_STACK_ENDPOINT_URL="http://infra-sim-inference-gateway-istio.llm-d-sim.svc.cluster.local:80"
# Explicit Model List to match deployment
export LLMDBENCH_DEPLOY_MODEL_LIST="unsloth/Meta-Llama-3.1-8B"
We observe that many requests are queued on vllm-sim servers, which never get depleted.
ms-sim-llm-d-modelservice-decode-6456997975-6tqrz 2/2 Running 0 136m
ms-sim-llm-d-modelservice-decode-6456997975-88v26 2/2 Running 0 109m
ms-sim-llm-d-modelservice-decode-6456997975-bhf4n 2/2 Running 0 146m
ms-sim-llm-d-modelservice-decode-6456997975-fvkxc 2/2 Running 0 141m
ms-sim-llm-d-modelservice-decode-6456997975-hvhk9 2/2 Running 0 150m
ms-sim-llm-d-modelservice-decode-6456997975-hvp92 2/2 Running 0 99m
ms-sim-llm-d-modelservice-decode-6456997975-j669r 2/2 Running 0 94m
ms-sim-llm-d-modelservice-decode-6456997975-lm7rm 2/2 Running 0 160m
ms-sim-llm-d-modelservice-decode-6456997975-nhvx5 2/2 Running 0 89m
ms-sim-llm-d-modelservice-decode-6456997975-tq4xx 2/2 Running 0 104m
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels