test-vllm-endpoints

Tests for the vLLM OpenAI and Anthropic compatible API server running models on any accelerator.

These tests can be downloaded and executed against any API endpoint running on bare metal or a Kubernetes cluster, provided the service is accessible on port 8009.

Prerequisites

Python 3.12+
vLLM server running at http://localhost:8009

Installation

Install test dependencies:

pip install -r requirements/test.txt

Project Structure

.
├── conftest.py
├── pytest.ini
├── requirements/
│   └── test.txt
├── tests/
│   └── openai_tests/
│       ├── __init__.py
│       ├── test_chat_completions.py
│       └── test_endpoints.py
└── variables/
    ├── __init__.py
    └── common.py

Configuration

All shared variables live in variables/common.py:

BASE_URL = "http://localhost:8009"
MODEL    = "meta-llama/Llama-3.2-1B-Instruct"
HEADERS  = {"Content-Type": "application/json"}

Update BASE_URL or MODEL here if your server runs on a different host or port.

Running the Tests

Note: Always pass -s so pytest prints what was tested, what was expected, and what the model responded.

Run all tests

pytest tests/openai_tests/ -v -s

`test_endpoints.py` — Health, Models, Tokenize, Detokenize

Run the full file

pytest tests/openai_tests/test_endpoints.py -v -s

Run a specific test class

pytest tests/openai_tests/test_endpoints.py::TestHealth -v -s
pytest tests/openai_tests/test_endpoints.py::TestModels -v -s

Run a single test by name

pytest tests/openai_tests/test_endpoint_health.py::TestHealth::test_health_status_200 -v -s
pytest tests/openai_tests/test_endpoint_health.py::TestModels::test_list_models_contains_expected_model -v -s

`test_chat_completions.py` — Chat Completions

Run the full file

pytest tests/openai_tests/test_chat_completions.py -v -s

Run a specific test class

pytest tests/openai_tests/test_chat_completions.py::TestBasicResponse -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers -v -s
pytest tests/openai_tests/test_chat_completions.py::TestMultiTurn -v -s
pytest tests/openai_tests/test_chat_completions.py::TestStreaming -v -s

Run a single test by name

pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_capital_of_france_is_paris -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_color_of_sky_is_blue -v -s
pytest tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_two_plus_two_equals_four -v -s
pytest tests/openai_tests/test_chat_completions.py::TestMultiTurn::test_recalls_name_from_context -v -s
pytest tests/openai_tests/test_chat_completions.py::TestStreaming::test_streaming_factual_answer_paris -v -s

Run tests matching a keyword

# All factual answer tests
pytest tests/openai_tests/ -v -s -k "factual"

# All streaming tests
pytest tests/openai_tests/ -v -s -k "streaming"

# All multi-turn tests
pytest tests/openai_tests/ -v -s -k "multi_turn"

# All tokenize tests
pytest tests/openai_tests/ -v -s -k "tokenize"

# All roundtrip tests
pytest tests/openai_tests/ -v -s -k "roundtrip"

Test Classes

`test_endpoints.py`

Class	Endpoint	What it tests
`TestHealth`	`GET /health`	HTTP 200, response time under 2000ms
`TestModels`	`GET /v1/models`	Structure, non-empty list, expected model present, field validation

`test_chat_completions.py`

Class	Endpoint	What it tests
`TestBasicResponse`	`POST /v1/chat/completions`	HTTP 200, response structure, usage fields, finish reason, max_tokens
`TestFactualAnswers`	`POST /v1/chat/completions`	Correct answers: Paris, blue, 4, Tokyo, 5
`TestMultiTurn`	`POST /v1/chat/completions`	Context recall across multiple conversation turns
`TestStreaming`	`POST /v1/chat/completions`	SSE streaming chunks, non-empty output, factual answers via stream

Example Output

tests/openai_tests/test_endpoints.py::TestTokenize::test_tokenize_returns_non_empty_tokens PASSED
    Tested: POST /v1/tokenize returns non-empty token list
    Prompt: Hello world
    Tokens: [9906, 1917]
    Count: 2

tests/openai_tests/test_endpoints.py::TestDetokenize::test_tokenize_detokenize_roundtrip PASSED
    Tested: Tokenize → Detokenize round-trip recovers original text
    Original: What is the capital of France?
    Tokens: [3923, 374, 279, 6864, 315, 9822, 30]
    Recovered: What is the capital of France?

tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_capital_of_france_is_paris PASSED
    Tested: Factual answer: 'What is the capital of France?'
    Expected: paris
    Response: Paris

tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_color_of_sky_is_blue PASSED
    Tested: Factual answer: 'What color is the sky?'
    Expected: blue
    Response: Blue

tests/openai_tests/test_chat_completions.py::TestFactualAnswers::test_two_plus_two_equals_four PASSED
    Tested: Factual answer: 'What is 2 + 2?'
    Expected: 4
    Response: 4

Starting the Server

Bare Python

python3 -m vllm.entrypoints.openai.api_server \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8009 \
  --trust-remote-code
  --tensor-parallel-size 1 \
  --model meta-llama/Llama-3.2-1B-Instruct 

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Embedding-0.6B \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --port 8009

python3 -m vllm.entrypoints.openai.api_server \
  --model BAAI/bge-reranker-v2-m3 \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --port 8009

vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --port 8009 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --port 8009

Docker

sudo docker run -it --rm \
  -e HUGGING_FACE_HUB_TOKEN="<your_token>" \
  -e HF_HUB_OFFLINE=0 \
  --network=host \
  --device=/dev/neuron0 \
  --device=/dev/neuron1 \
  -p 8009:8009 \
  vllm-neuron:latest \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8009 \
  --tensor-parallel-size 2 \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --additional-config '{ "override_neuron_config": { "async_mode": true } }'

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
image_docker_files		image_docker_files
requirements		requirements
tests		tests
variables		variables
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

test-vllm-endpoints

Prerequisites

Installation

Project Structure

Configuration

Running the Tests

Run all tests

`test_endpoints.py` — Health, Models, Tokenize, Detokenize

Run the full file

Run a specific test class

Run a single test by name

`test_chat_completions.py` — Chat Completions

Run the full file

Run a specific test class

Run a single test by name

Run tests matching a keyword

Test Classes

`test_endpoints.py`

`test_chat_completions.py`

Example Output

Starting the Server

Bare Python

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

test-vllm-endpoints

Prerequisites

Installation

Project Structure

Configuration

Running the Tests

Run all tests

test_endpoints.py — Health, Models, Tokenize, Detokenize

Run the full file

Run a specific test class

Run a single test by name

test_chat_completions.py — Chat Completions

Run the full file

Run a specific test class

Run a single test by name

Run tests matching a keyword

Test Classes

test_endpoints.py

test_chat_completions.py

Example Output

Starting the Server

Bare Python

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`test_endpoints.py` — Health, Models, Tokenize, Detokenize

`test_chat_completions.py` — Chat Completions

`test_endpoints.py`

`test_chat_completions.py`

Packages