Skip to content

debroy-rh/test-vllm-endpoints

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

test-vllm-endpoints

Tests for the vLLM OpenAI and Anthropic compatible API server running models on any accelerator.

These tests can be downloaded and executed against any API endpoint running on bare metal or a Kubernetes cluster, provided the service is accessible on port 8009.


Prerequisites

  • Python 3.12+
  • vLLM server running at http://localhost:8009

Installation

Install test dependencies:

pip install -r requirements/test.txt

Project Structure

.
├── conftest.py
├── pytest.ini
├── requirements/
│   └── test.txt
├── tests/
│   └── openai_tests/
│       ├── __init__.py
│       ├── test_chat_completions.py
│       └── test_endpoints.py
└── variables/
    ├── __init__.py
    └── common.py

Configuration

All shared variables live in variables/common.py:

BASE_URL = "http://localhost:8009"
MODEL    = "meta-llama/Llama-3.2-1B-Instruct"
HEADERS  = {"Content-Type": "application/json"}

Update BASE_URL or MODEL here if your server runs on a different host or port.


Running the Tests

Note: Always pass -s so pytest prints what was tested, what was expected, and what the model responded.

Run all tests

pytest tests/openai_tests/ -v -s

test_endpoints.py — Health, Models, Tokenize, Detokenize

Run the full file

pytest tests/openai_tests/test_endpoints.py -v -s

Run a specific test class

pytest tests/openai_tests/test_endpoints.py::TestHealth -v -s
pytest tests/openai_tests/test_endpoints.py::TestModels -v -s

Run a single test by name

pytest tests/openai_tests/test_endpoint_health.py::TestHealth::test_health_status_200 -v -s
pytest tests/openai_tests/test_endpoint_health.py::TestModels::test_list_models_contains_expected_model -v -s

test_chat_completions.py — Chat Completions

Run the full file

pytest tests/openai_tests/test_chat_completions.py -v -s

Run a specific test class

pytest tests/openai_tests/test_chat_completions.py::TestBasicResponse -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers -v -s
pytest tests/openai_tests/test_chat_completions.py::TestMultiTurn -v -s
pytest tests/openai_tests/test_chat_completions.py::TestStreaming -v -s

Run a single test by name

pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_capital_of_france_is_paris -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_color_of_sky_is_blue -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_two_plus_two_equals_four -v -s
pytest tests/openai_tests/test_chat_completions.py::TestMultiTurn::test_recalls_name_from_context -v -s
pytest tests/openai_tests/test_chat_completions.py::TestStreaming::test_streaming_factual_answer_paris -v -s

Run tests matching a keyword

# All factual answer tests
pytest tests/openai_tests/ -v -s -k "factual"

# All streaming tests
pytest tests/openai_tests/ -v -s -k "streaming"

# All multi-turn tests
pytest tests/openai_tests/ -v -s -k "multi_turn"

# All tokenize tests
pytest tests/openai_tests/ -v -s -k "tokenize"

# All roundtrip tests
pytest tests/openai_tests/ -v -s -k "roundtrip"

Test Classes

test_endpoints.py

Class Endpoint What it tests
TestHealth GET /health HTTP 200, response time under 2000ms
TestModels GET /v1/models Structure, non-empty list, expected model present, field validation

test_chat_completions.py

Class Endpoint What it tests
TestBasicResponse POST /v1/chat/completions HTTP 200, response structure, usage fields, finish reason, max_tokens
TestFactualAnswers POST /v1/chat/completions Correct answers: Paris, blue, 4, Tokyo, 5
TestMultiTurn POST /v1/chat/completions Context recall across multiple conversation turns
TestStreaming POST /v1/chat/completions SSE streaming chunks, non-empty output, factual answers via stream

Example Output

tests/openai_tests/test_endpoints.py::TestTokenize::test_tokenize_returns_non_empty_tokens PASSED
    Tested: POST /v1/tokenize returns non-empty token list
    Prompt: Hello world
    Tokens: [9906, 1917]
    Count: 2

tests/openai_tests/test_endpoints.py::TestDetokenize::test_tokenize_detokenize_roundtrip PASSED
    Tested: Tokenize → Detokenize round-trip recovers original text
    Original: What is the capital of France?
    Tokens: [3923, 374, 279, 6864, 315, 9822, 30]
    Recovered: What is the capital of France?

tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_capital_of_france_is_paris PASSED
    Tested: Factual answer: 'What is the capital of France?'
    Expected: paris
    Response: Paris

tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_color_of_sky_is_blue PASSED
    Tested: Factual answer: 'What color is the sky?'
    Expected: blue
    Response: Blue

tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_two_plus_two_equals_four PASSED
    Tested: Factual answer: 'What is 2 + 2?'
    Expected: 4
    Response: 4

Starting the Server

Bare Python

python3 -m vllm.entrypoints.openai.api_server \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8009 \
  --trust-remote-code
  --tensor-parallel-size 1 \
  --model meta-llama/Llama-3.2-1B-Instruct 

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Embedding-0.6B \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --port 8009

python3 -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-reranker-v2-m3 \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --port 8009

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --port 8009 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --port 8009

Docker

sudo docker run -it --rm \
  -e HUGGING_FACE_HUB_TOKEN="<your_token>" \
  -e HF_HUB_OFFLINE=0 \
  --network=host \
  --device=/dev/neuron0 \
  --device=/dev/neuron1 \
  -p 8009:8009 \
  vllm-neuron:latest \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8009 \
  --tensor-parallel-size 2 \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --additional-config '{ "override_neuron_config": { "async_mode": true } }'

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages