Tests for the vLLM OpenAI and Anthropic compatible API server running models on any accelerator.
These tests can be downloaded and executed against any API endpoint running on bare metal or a Kubernetes cluster, provided the service is accessible on port 8009.
- Python 3.12+
- vLLM server running at
http://localhost:8009
Install test dependencies:
pip install -r requirements/test.txt.
├── conftest.py
├── pytest.ini
├── requirements/
│ └── test.txt
├── tests/
│ └── openai_tests/
│ ├── __init__.py
│ ├── test_chat_completions.py
│ └── test_endpoints.py
└── variables/
├── __init__.py
└── common.py
All shared variables live in variables/common.py:
BASE_URL = "http://localhost:8009"
MODEL = "meta-llama/Llama-3.2-1B-Instruct"
HEADERS = {"Content-Type": "application/json"}Update BASE_URL or MODEL here if your server runs on a different host or port.
Note: Always pass
-sso pytest prints what was tested, what was expected, and what the model responded.
pytest tests/openai_tests/ -v -spytest tests/openai_tests/test_endpoints.py -v -spytest tests/openai_tests/test_endpoints.py::TestHealth -v -s
pytest tests/openai_tests/test_endpoints.py::TestModels -v -spytest tests/openai_tests/test_endpoint_health.py::TestHealth::test_health_status_200 -v -s
pytest tests/openai_tests/test_endpoint_health.py::TestModels::test_list_models_contains_expected_model -v -spytest tests/openai_tests/test_chat_completions.py -v -spytest tests/openai_tests/test_chat_completions.py::TestBasicResponse -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers -v -s
pytest tests/openai_tests/test_chat_completions.py::TestMultiTurn -v -s
pytest tests/openai_tests/test_chat_completions.py::TestStreaming -v -spytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_capital_of_france_is_paris -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_color_of_sky_is_blue -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_two_plus_two_equals_four -v -s
pytest tests/openai_tests/test_chat_completions.py::TestMultiTurn::test_recalls_name_from_context -v -s
pytest tests/openai_tests/test_chat_completions.py::TestStreaming::test_streaming_factual_answer_paris -v -s# All factual answer tests
pytest tests/openai_tests/ -v -s -k "factual"
# All streaming tests
pytest tests/openai_tests/ -v -s -k "streaming"
# All multi-turn tests
pytest tests/openai_tests/ -v -s -k "multi_turn"
# All tokenize tests
pytest tests/openai_tests/ -v -s -k "tokenize"
# All roundtrip tests
pytest tests/openai_tests/ -v -s -k "roundtrip"| Class | Endpoint | What it tests |
|---|---|---|
TestHealth |
GET /health |
HTTP 200, response time under 2000ms |
TestModels |
GET /v1/models |
Structure, non-empty list, expected model present, field validation |
| Class | Endpoint | What it tests |
|---|---|---|
TestBasicResponse |
POST /v1/chat/completions |
HTTP 200, response structure, usage fields, finish reason, max_tokens |
TestFactualAnswers |
POST /v1/chat/completions |
Correct answers: Paris, blue, 4, Tokyo, 5 |
TestMultiTurn |
POST /v1/chat/completions |
Context recall across multiple conversation turns |
TestStreaming |
POST /v1/chat/completions |
SSE streaming chunks, non-empty output, factual answers via stream |
tests/openai_tests/test_endpoints.py::TestTokenize::test_tokenize_returns_non_empty_tokens PASSED
Tested: POST /v1/tokenize returns non-empty token list
Prompt: Hello world
Tokens: [9906, 1917]
Count: 2
tests/openai_tests/test_endpoints.py::TestDetokenize::test_tokenize_detokenize_roundtrip PASSED
Tested: Tokenize → Detokenize round-trip recovers original text
Original: What is the capital of France?
Tokens: [3923, 374, 279, 6864, 315, 9822, 30]
Recovered: What is the capital of France?
tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_capital_of_france_is_paris PASSED
Tested: Factual answer: 'What is the capital of France?'
Expected: paris
Response: Paris
tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_color_of_sky_is_blue PASSED
Tested: Factual answer: 'What color is the sky?'
Expected: blue
Response: Blue
tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_two_plus_two_equals_four PASSED
Tested: Factual answer: 'What is 2 + 2?'
Expected: 4
Response: 4
python3 -m vllm.entrypoints.openai.api_server \
--max-model-len 4096 \
--max-num-seqs 1 \
--no-enable-prefix-caching \
--port 8009 \
--trust-remote-code
--tensor-parallel-size 1 \
--model meta-llama/Llama-3.2-1B-Instruct
python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-Embedding-0.6B \
--trust-remote-code \
--tensor-parallel-size 1 \
--port 8009
python3 -m vllm.entrypoints.openai.api_server \
--model BAAI/bge-reranker-v2-m3 \
--trust-remote-code \
--tensor-parallel-size 1 \
--port 8009
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--port 8009 \
--tool-call-parser mistral \
--enable-auto-tool-choice
python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-VL-2B-Instruct \
--trust-remote-code \
--tensor-parallel-size 1 \
--port 8009sudo docker run -it --rm \
-e HUGGING_FACE_HUB_TOKEN="<your_token>" \
-e HF_HUB_OFFLINE=0 \
--network=host \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
-p 8009:8009 \
vllm-neuron:latest \
--max-model-len 4096 \
--max-num-seqs 1 \
--no-enable-prefix-caching \
--port 8009 \
--tensor-parallel-size 2 \
--model meta-llama/Llama-3.2-1B-Instruct \
--additional-config '{ "override_neuron_config": { "async_mode": true } }'