Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
296 changes: 296 additions & 0 deletions tests/ci/test_docs_end_to_end/DOC_CI_TEST_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# Adding New End-to-End Tests for Documentation Examples

This guide explains how to add new end-to-end tests for server examples in the AIPerf documentation.

## Overview

The end-to-end test framework automatically discovers and tests server examples from markdown documentation files. It:
1. Parses markdown files for specially tagged bash commands
2. Builds an AIPerf Docker container
3. For each discovered server:
- Runs the server setup command
- Waits for the server to become healthy
- Executes AIPerf benchmark commands
- Validates results and cleans up

## How Tests Are Discovered

The test parser (`parser.py`) scans all markdown files (`*.md`) in the repository and looks for HTML comment tags with specific patterns:

- **Setup commands**: `<!-- setup-{server-name}-endpoint-server -->`
- **Health checks**: `<!-- health-check-{server-name}-endpoint-server -->`
- **AIPerf commands**: `<!-- aiperf-run-{server-name}-endpoint-server -->`

Each tag must be followed by a bash code block (` ```bash ... ``` `) containing the actual command.

## Adding a New Server Test

To add tests for a new server, you need to add three types of tagged commands to your documentation:

### 1. Server Setup Command

Tag the bash command that starts your server:

<!-- setup-myserver-endpoint-server -->
```bash
# Start your server
docker run --gpus all -p 8000:8000 myserver/image:latest \
--model my-model \
--host 0.0.0.0 --port 8000
```
<!-- /setup-myserver-endpoint-server -->

**Important notes:**
- The server name (`myserver` in this example) must be consistent across all three tag types
- The setup command runs in the background
- The command should start a long-running server process
- Use port 8000 or ensure your health check targets the correct port

### 2. Health Check Command

Tag a bash command that waits for your server to be ready:

<!-- health-check-myserver-endpoint-server -->
```bash
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/health -H "Content-Type: application/json")" != "200" ]; do sleep 2; done' || { echo "Server not ready after 15min"; exit 1; }
```
<!-- /health-check-myserver-endpoint-server -->

**Important notes:**
- The health check should poll the server until it responds successfully
- Use a reasonable timeout (e.g., 900 seconds = 15 minutes)
- The command must exit with code 0 when the server is healthy
- The command must exit with non-zero code if the server fails to start

### 3. AIPerf Run Commands

Tag one or more AIPerf benchmark commands:

<!-- aiperf-run-myserver-endpoint-server -->
```bash
aiperf profile \
--model my-model \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--service-kind openai \
--streaming \
--num-prompts 10 \
--max-tokens 100
```
<!-- /aiperf-run-myserver-endpoint-server -->

You can have multiple `aiperf-run` commands for the same server. Each will be executed sequentially against the same running server instance (the server is NOT restarted between commands):

<!-- aiperf-run-myserver-endpoint-server -->
```bash
# First test: streaming mode
aiperf profile \
--model my-model \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--service-kind openai \
--streaming \
--num-prompts 10
```
<!-- /aiperf-run-myserver-endpoint-server -->

<!-- aiperf-run-myserver-endpoint-server -->
```bash
# Second test: non-streaming mode
aiperf profile \
--model my-model \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--service-kind openai \
--num-prompts 10
```
<!-- /aiperf-run-myserver-endpoint-server -->

**Important notes:**
- Do NOT include `--ui-type` flag - the test framework adds `--ui-type simple` automatically
- Each command is executed inside the AIPerf Docker container
- Commands should complete in a reasonable time (default timeout: 300 seconds)
- Use small values for `--num-prompts` and `--max-tokens` to keep tests fast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note that the server is not taken down in between aiperf profile commands?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is a good note, i missed that, will add that to the readme.

- The server is NOT restarted between multiple aiperf commands - all commands run against the same server instance

## Complete Example

Here's a complete example for a new server called "fastapi":

### Running FastAPI Server

Start the FastAPI server:

<!-- setup-fastapi-endpoint-server -->
```bash
docker run --gpus all -p 8000:8000 mycompany/fastapi-llm:latest \
--model-name meta-llama/Llama-3.2-1B \
--host 0.0.0.0 \
--port 8000
```
<!-- /setup-fastapi-endpoint-server -->

Wait for the server to be ready:

<!-- health-check-fastapi-endpoint-server -->
```bash
timeout 600 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/models)" != "200" ]; do sleep 2; done' || { echo "FastAPI server not ready after 10min"; exit 1; }
```
<!-- /health-check-fastapi-endpoint-server -->

Profile the model:

<!-- aiperf-run-fastapi-endpoint-server -->
```bash
aiperf profile \
--model meta-llama/Llama-3.2-1B \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--service-kind openai \
--streaming \
--num-prompts 20 \
--max-tokens 50
```
<!-- /aiperf-run-fastapi-endpoint-server -->

## Running the Tests

### Run all discovered tests:

```bash
cd tests/ci/test_docs_end_to_end
python main.py
```

### Dry run to see what would be tested:

```bash
python main.py --dry-run
```

### Test specific servers:

Currently, the framework tests the first discovered server by default. Use `--all-servers` to test all:

```bash
python main.py --all-servers
```

## Validation Rules

The test framework validates that each server has:
- Exactly ONE setup command (duplicates cause test failure)
- Exactly ONE health check command (duplicates cause test failure)
- At least ONE aiperf command

If any of these requirements are not met, the tests will fail with a clear error message.

## Test Execution Flow

For each server, the test runner:

1. **Build Phase**: Builds the AIPerf Docker container (once for all tests)
2. **Setup Phase**: Starts the server in the background
3. **Health Check Phase**: Waits for server to be ready (runs in parallel with setup)
4. **Test Phase**: Executes all AIPerf commands sequentially against the same running server instance
5. **Cleanup Phase**: Gracefully shuts down the server and cleans up Docker resources

**Note**: The server remains running throughout all AIPerf commands. It is only shut down once during the cleanup phase after all tests complete.

## Common Patterns

### Pattern: OpenAI-compatible API

<!-- setup-myserver-endpoint-server -->
```bash
docker run --gpus all -p 8000:8000 myserver:latest \
--model model-name \
--host 0.0.0.0 --port 8000
```
<!-- /setup-myserver-endpoint-server -->

<!-- health-check-myserver-endpoint-server -->
```bash
timeout 900 bash -c 'while [ "$(curl -s -o /dev/null -w "%{http_code}" localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\":\"model-name\",\"messages\":[{\"role\":\"user\",\"content\":\"test\"}],\"max_tokens\":1}")" != "200" ]; do sleep 2; done' || { echo "Server not ready"; exit 1; }
```
<!-- /health-check-myserver-endpoint-server -->

<!-- aiperf-run-myserver-endpoint-server -->
```bash
aiperf profile \
--model model-name \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--service-kind openai \
--streaming \
--num-prompts 10 \
--max-tokens 100
```
<!-- /aiperf-run-myserver-endpoint-server -->

## Troubleshooting

### Tests not discovered

- Verify tag format: `setup-{name}-endpoint-server`, `health-check-{name}-endpoint-server`, `aiperf-run-{name}-endpoint-server`
- Ensure bash code block immediately follows the tag
- Check that the server name is consistent across all three tag types
- Run `python main.py --dry-run` to see what's discovered

### Health check timeout

- Increase the timeout value in your health check command
- Verify the health check endpoint is correct
- Check server logs: the test runner shows setup output for 30 seconds
- Ensure your server starts on the expected port

### AIPerf command fails

- Test your AIPerf command manually first
- Use small values for `--num-prompts` and `--max-tokens`
- Verify the model name matches what the server expects
- Check that the endpoint URL is correct

### Duplicate command errors

If you see errors like "DUPLICATE SETUP COMMAND", you have multiple commands with the same server name:
- Search your docs for all instances of that tag
- Ensure each server has a unique name
- Or remove duplicate tags if they're truly duplicates

## Best Practices

1. **Keep tests fast**: Use minimal `--num-prompts` (10-20) and small `--max-tokens` values
2. **Use standard ports**: Default to 8000 for consistency
3. **Add timeouts**: Always include timeouts in health checks
4. **Test locally first**: Run commands manually before adding tags
5. **One server per doc section**: Avoid mixing multiple servers in the same doc section
6. **Clear error messages**: Include helpful error messages in health checks
7. **Document requirements**: Note any GPU, memory, or dependency requirements in surrounding text

## Architecture Reference

Key files in the test framework:

- `main.py`: Entry point, orchestrates parsing and testing
- `parser.py`: Markdown parser that discovers tagged commands
- `test_runner.py`: Executes tests for each server
- `constants.py`: Configuration constants (timeouts, tag patterns)
- `data_types.py`: Data models for commands and servers
- `utils.py`: Utility functions for Docker operations

## Constants and Configuration

Key constants in `constants.py`:

- `SETUP_MONITOR_TIMEOUT`: 30 seconds (how long to monitor setup output)
- `CONTAINER_BUILD_TIMEOUT`: 600 seconds (Docker build timeout)
- `AIPERF_COMMAND_TIMEOUT`: 300 seconds (per-command timeout)
- `AIPERF_UI_TYPE`: "simple" (auto-added to all aiperf commands)

To modify these, edit `constants.py`.
4 changes: 4 additions & 0 deletions tests/ci/test_docs_end_to_end/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ def parse_directory(self, directory: str) -> dict[str, Server]:
logger.info(f"Parsing markdown files in {directory}")

for file_path in Path(directory).rglob("*.md"):
# Skip the documentation file for this test framework
if file_path.name == "DOC_CI_TEST_README.md":
logger.info(f"Skipping documentation file: {file_path}")
continue
logger.info(f"Parsing file: {file_path}")
self._parse_file(str(file_path))

Expand Down
Loading