Skip to content

Frozen requests on sim server when running benchmarks #339

@asm582

Description

@asm582

We observe requests freezing on the vLLM sim when running benchmarks with llm-d-benchmark and WVA.

Steps to reproduce:

  • In WVA repo install simulated kind environment with the make command - make deploy-wva-emulated-on-kind CREATE_CLUSTER=true DEPLOY_LLM_D=true
  • downscale model service replica to 1
  • After the stack is deployed, addthe below YAML patches to the modelservice and later to HPA
spec:
  template:
    spec:
      containers:
      - name: vllm
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        args:
        - --model
        - unsloth/Meta-Llama-3.1-8B
        - --port
        - "8200"
        - --served-model-name
        - unsloth/Meta-Llama-3.1-8B
        - --time-to-first-token=200
        - --inter-token-latency=20
        - --enable-kvcache
        - --kv-cache-size=1024
        - --block-size=16
        - --tokenizers-cache-dir=/tmp
        
spec:
  metrics:
  - type: External
    external:
      metric:
        name: wva_desired_replicas
        selector:
          matchLabels:
            variant_name: ms-sim-llm-d-modelservice-decode
            exported_namespace: llm-d-sim
      target:
        type: AverageValue
        averageValue: 1
  • Start llmd benchmark using command ./run.sh -c scenarios/constant_traffic_20min.sh below is the scenarios file
#!/usr/bin/env bash

# Simulation Configuration
export LLMDBENCH_HARNESS_NAME="guidellm"
export LLMDBENCH_HARNESS_EXPERIMENT_PROFILE="constant_10rps_20min"
export LLMDBENCH_VLLM_COMMON_NAMESPACE="llm-d-sim"
export LLMDBENCH_HARNESS_NAMESPACE="llm-d-sim"

# WVA Configuration
export LLMDBENCH_WVA_ENABLED=true

# PVC Configuration
export LLMDBENCH_VLLM_COMMON_PVC_ACCESS_MODE="ReadWriteOnce"

# Harness Resources
export LLMDBENCH_HARNESS_CPU_NR=0.5
export LLMDBENCH_HARNESS_CPU_MEM=1Gi

# Explicit Endpoint Override
export LLMDBENCH_HARNESS_STACK_ENDPOINT_URL="http://infra-sim-inference-gateway-istio.llm-d-sim.svc.cluster.local:80"

# Explicit Model List to match deployment
export LLMDBENCH_DEPLOY_MODEL_LIST="unsloth/Meta-Llama-3.1-8B"

We observe that many requests are queued on vllm-sim servers, which never get depleted.

ms-sim-llm-d-modelservice-decode-6456997975-6tqrz    2/2     Running   0             136m
ms-sim-llm-d-modelservice-decode-6456997975-88v26    2/2     Running   0             109m
ms-sim-llm-d-modelservice-decode-6456997975-bhf4n    2/2     Running   0             146m
ms-sim-llm-d-modelservice-decode-6456997975-fvkxc    2/2     Running   0             141m
ms-sim-llm-d-modelservice-decode-6456997975-hvhk9    2/2     Running   0             150m
ms-sim-llm-d-modelservice-decode-6456997975-hvp92    2/2     Running   0             99m
ms-sim-llm-d-modelservice-decode-6456997975-j669r    2/2     Running   0             94m
ms-sim-llm-d-modelservice-decode-6456997975-lm7rm    2/2     Running   0             160m
ms-sim-llm-d-modelservice-decode-6456997975-nhvx5    2/2     Running   0             89m
ms-sim-llm-d-modelservice-decode-6456997975-tq4xx    2/2     Running   0             104m

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions