Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 33 additions & 14 deletions integration-tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The server echoes user prompts back token by token with configurable latencies,
- **Pre-loading support**: Pre-load tokenizers at startup for faster responses
- **Runtime configuration**: Configure server settings via `/configure` endpoint
- **Comprehensive logging**: Configurable log levels and access logs
- **Error simulation**: Returns 404 for unsupported models to simulate real-world scenarios
- **Fallback tokenizer**: Uses a configurable fallback tokenizer (default: Qwen/Qwen3-0.6B) when requested model is unavailable

## Installation

Expand All @@ -39,14 +39,18 @@ pip install -e ".[dev]"

## Usage

> [!IMPORTANT]
>You must provide a model tokenizer to load either on start or dynamically
> after running using the /configure endpoint, or all requests will return HTTP 404.
> [!NOTE]
> The server includes a default fallback tokenizer (Qwen/Qwen3-0.6B) that will be used automatically
> if the requested model's tokenizer is not available. You can configure a different fallback tokenizer
> via command line arguments or the /configure endpoint.

### Command Line

```bash
# Basic usage
# Basic usage (uses default fallback tokenizer)
aiperf-mock-server

# With specific model pre-loaded
aiperf-mock-server -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B

# Custom configuration with short flags
Expand All @@ -60,14 +64,16 @@ aiperf-mock-server \
--host 127.0.0.1 \
--workers 4 \
--log-level DEBUG \
--tokenizer-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tokenizer-models deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--fallback-tokenizer Qwen/Qwen3-0.6B

# With environment variables
export MOCK_SERVER_PORT=8000
export MOCK_SERVER_TTFT=30
export MOCK_SERVER_ITL=10
export MOCK_SERVER_LOG_LEVEL=DEBUG
export MOCK_SERVER_TOKENIZER_MODELS='["deepseek-ai/DeepSeek-R1-Distill-Llama-8B"]'
export MOCK_SERVER_FALLBACK_TOKENIZER="Qwen/Qwen3-0.6B"
aiperf-mock-server
```

Expand All @@ -83,6 +89,7 @@ All configuration options can be set via environment variables with the `MOCK_SE
- `MOCK_SERVER_LOG_LEVEL`: Logging level (default: INFO)
- `MOCK_SERVER_ACCESS_LOGS`: Enable HTTP access logs (default: false)
- `MOCK_SERVER_TOKENIZER_MODELS`: JSON-formatted array of models to pre-load
- `MOCK_SERVER_FALLBACK_TOKENIZER`: Fallback tokenizer model (default: Qwen/Qwen3-0.6B)

### API Usage

Expand Down Expand Up @@ -119,13 +126,16 @@ curl -X POST http://localhost:8000/v1/chat/completions \
#### Runtime Configuration

```bash
# Configure latencies at runtime
# Configure latencies and tokenizers at runtime.
# The following are all possible configuration options.
# See the Configuration Options section for more details.
curl -X POST http://localhost:8000/configure \
-H "Content-Type: application/json" \
-d '{
"ttft": 100,
"itl": 25,
"tokenizer_models": ["deepseek-ai/DeepSeek-R1-Distill-Llama-8B"]
"tokenizer_models": ["deepseek-ai/DeepSeek-R1-Distill-Llama-8B"],
"fallback_tokenizer": "Qwen/Qwen3-0.6B"
}'
```

Expand Down Expand Up @@ -153,6 +163,7 @@ curl http://localhost:8000/
| Log Level | `--log-level`, `-l` | `MOCK_SERVER_LOG_LEVEL` | INFO | Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) |
| Access Logs | `--access-logs`, `-a` | `MOCK_SERVER_ACCESS_LOGS` | false | Enable HTTP access logs |
| Tokenizer Models | `--tokenizer-models`, `-m` | `MOCK_SERVER_TOKENIZER_MODELS` | [] | Models to pre-load at startup |
| Fallback Tokenizer | `--fallback-tokenizer` | `MOCK_SERVER_FALLBACK_TOKENIZER` | Qwen/Qwen3-0.6B | Fallback tokenizer when requested model's tokenizer is unavailable |

Configuration priority (highest to lowest):
1. CLI arguments
Expand All @@ -175,7 +186,8 @@ Runtime configuration endpoint for updating server settings.
{
"ttft": 50,
"itl": 15,
"tokenizer_models": ["deepseek-ai/DeepSeek-R1-Distill-Llama-8B"]
"tokenizer_models": ["deepseek-ai/DeepSeek-R1-Distill-Llama-8B"],
"fallback_tokenizer": "Qwen/Qwen3-0.6B"
}
```

Expand All @@ -188,13 +200,17 @@ Root endpoint providing server information and available endpoints.
## How It Works

1. **Request Processing**: The server receives a chat completion request
2. **Tokenization**: Uses the model-specific tokenizer to tokenize the user prompt
3. **Token Limit**: Respects the `max_tokens` parameter if specified
4. **Latency Simulation**:
2. **Tokenizer Selection**:
- First attempts to use the requested model's tokenizer if pre-loaded
- Falls back to the configured fallback tokenizer if the requested model is unavailable
- Returns 404 only if both the requested model and fallback tokenizer fail
3. **Tokenization**: Uses the selected tokenizer to tokenize the user prompt
4. **Token Limit**: Respects the `max_completion_tokens` parameter if specified
5. **Latency Simulation**:
- Waits for the configured TTFT before sending the first token
- Waits for the configured ITL between subsequent tokens
- Uses `perf_counter` for precise timing control
5. **Response**: Echoes back the tokenized prompt either as:
6. **Response**: Echoes back the tokenized prompt either as:
- A complete response (non-streaming)
- Token-by-token chunks (streaming)

Expand All @@ -204,7 +220,10 @@ The server uses Hugging Face Transformers to load tokenizers for any supported m
- Available on Hugging Face Hub
- Compatible with `AutoTokenizer.from_pretrained()`

If a tokenizer fails to load for a requested model, the server returns a 404 error to simulate model unavailability.
**Fallback Behavior**:
- If a tokenizer has not been pre-loaded for a requested model, the server automatically falls back to the configured fallback tokenizer (default: `Qwen/Qwen3-0.6B`)
- The server only returns a 404 error if both the requested model's tokenizer and the fallback tokenizer fail to load
- This ensures the server remains functional even when specific model tokenizers are unavailable

## Development

Expand Down
19 changes: 14 additions & 5 deletions integration-tests/mock_server/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,15 @@ async def lifespan(_: FastAPI):
"""Initialize tokenizers and other startup tasks."""
logger.info("Server configuration: %s", server_config.model_dump())

if server_config.tokenizer_models:
logger.info(f"Pre-loading tokenizer models: {server_config.tokenizer_models}")
tokenizer_service.load_tokenizers(server_config.tokenizer_models)
logger.info("Tokenizer models loaded successfully")
tokenizer_models = [
*server_config.tokenizer_models,
server_config.fallback_tokenizer,
]

logger.info(f"Pre-loading tokenizer models: {tokenizer_models}")
tokenizer_service.set_fallback_tokenizer(server_config.fallback_tokenizer)
tokenizer_service.load_tokenizers(tokenizer_models)
logger.info("Tokenizer models loaded successfully")

yield

Expand Down Expand Up @@ -71,6 +76,7 @@ def set_server_config(config: MockServerConfig) -> None:
os.environ["MOCK_SERVER_PORT"] = str(config.port)
os.environ["MOCK_SERVER_WORKERS"] = str(config.workers)
os.environ["MOCK_SERVER_ACCESS_LOGS"] = str(config.access_logs)
os.environ["MOCK_SERVER_FALLBACK_TOKENIZER"] = str(config.fallback_tokenizer)


def extract_user_prompt(messages: list[ChatMessage]) -> str:
Expand Down Expand Up @@ -135,7 +141,10 @@ async def configure(request: ConfigureMessage):
logger.info(f"Loading tokenizer models: {request.tokenizer_models}")
tokenizer_service.load_tokenizers(request.tokenizer_models)
logger.info("Tokenizer models loaded successfully")

if request.fallback_tokenizer is not None:
tokenizer_service.load_tokenizers([request.fallback_tokenizer])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always supposed to be called, even if the tokenizer specified exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i need to add a check to see if its already been loaded. Code rabbit brought that up a few tim4es.

tokenizer_service.set_fallback_tokenizer(request.fallback_tokenizer)
logger.info(f"Fallback tokenizer set to {request.fallback_tokenizer}")
Comment on lines +144 to +147
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persist fallback choice back into server_config

We load and set the new fallback on the service, but server_config.fallback_tokenizer stays at its previous value. As a result, /configure responses and /health still report the old fallback, and any code that later reads server_config.fallback_tokenizer (including multi-worker env propagation or reload flows) will drift from the active tokenizer. Please assign it so the in-memory config stays truthful.

     if request.fallback_tokenizer is not None:
         tokenizer_service.load_tokenizers([request.fallback_tokenizer])
         tokenizer_service.set_fallback_tokenizer(request.fallback_tokenizer)
+        server_config.fallback_tokenizer = request.fallback_tokenizer
         logger.info(f"Fallback tokenizer set to {request.fallback_tokenizer}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if request.fallback_tokenizer is not None:
tokenizer_service.load_tokenizers([request.fallback_tokenizer])
tokenizer_service.set_fallback_tokenizer(request.fallback_tokenizer)
logger.info(f"Fallback tokenizer set to {request.fallback_tokenizer}")
if request.fallback_tokenizer is not None:
tokenizer_service.load_tokenizers([request.fallback_tokenizer])
tokenizer_service.set_fallback_tokenizer(request.fallback_tokenizer)
server_config.fallback_tokenizer = request.fallback_tokenizer
logger.info(f"Fallback tokenizer set to {request.fallback_tokenizer}")
🤖 Prompt for AI Agents
In integration-tests/mock_server/app.py around lines 144 to 147, the code sets
the fallback on the tokenizer_service but doesn't update the in-memory
server_config; set server_config.fallback_tokenizer = request.fallback_tokenizer
after loading/setting the tokenizer so the server_config reflects the active
fallback (ensure server_config is in scope or import it if needed).

return {"status": "configured", "config": server_config.model_dump()}


Expand Down
7 changes: 7 additions & 0 deletions integration-tests/mock_server/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,13 @@ class MockServerConfig(BaseSettings):
),
] = []

fallback_tokenizer: Annotated[
str,
Field(
description="Fallback tokenizer to use if the requested tokenizer is not found",
),
] = "Qwen/Qwen3-0.6B"

access_logs: Annotated[
bool,
Field(
Expand Down
4 changes: 4 additions & 0 deletions integration-tests/mock_server/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ class ConfigureMessage(BaseModel):
tokenizer_models: list[str] | None = Field(
default=None, description="List of tokenizer models to load"
)
fallback_tokenizer: str | None = Field(
default=None,
description="Fallback tokenizer to use if the requested tokenizer is not found",
)


class Role(str, Enum):
Expand Down
22 changes: 19 additions & 3 deletions integration-tests/mock_server/tokenizer_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,17 @@
# SPDX-License-Identifier: Apache-2.0
"""Tokenizer service for handling different model tokenizers."""

import contextlib
import io
import logging

from transformers import AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer
# Silence tokenizer warning on import and first use
with (
contextlib.redirect_stdout(io.StringIO()) as _,
contextlib.redirect_stderr(io.StringIO()),
):
from transformers import AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

logger = logging.getLogger(__name__)

Expand All @@ -15,6 +22,7 @@ class TokenizerService:

def __init__(self):
self._tokenizers: dict[str, PreTrainedTokenizer] = {}
self._fallback_tokenizer: str | None = None

def load_tokenizers(self, model_names: list[str]) -> None:
"""Pre-load tokenizers for one or more models.
Expand All @@ -34,7 +42,11 @@ def load_tokenizers(self, model_names: list[str]) -> None:
def get_tokenizer(self, model_name: str) -> PreTrainedTokenizer:
"""Get or create a tokenizer for the specified model."""
if model_name not in self._tokenizers:
raise ValueError(f"No tokenizer loaded for {model_name}")
if self._fallback_tokenizer not in self._tokenizers:
raise ValueError(
f"No tokenizer loaded for {model_name} or {self._fallback_tokenizer}"
)
model_name = self._fallback_tokenizer

Comment on lines 44 to 50
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore lazy tokenizer loading before falling back.

We now short-circuit to the fallback whenever the requested model key is missing, which means we never even try to lazily load the requested tokenizer anymore. In the current server flows that rely on lazy loading (run without preloading, runtime configure updates, etc.), this silently swaps responses to the fallback tokenizer or just raises when no fallback is configured—a regression from today’s behavior. Please attempt to load the requested tokenizer first and only fall back when that load really fails, while also bootstrapping the fallback if it hasn’t been loaded yet.

-        if model_name not in self._tokenizers:
-            if self._fallback_tokenizer not in self._tokenizers:
-                raise ValueError(
-                    f"No tokenizer loaded for {model_name} or {self._fallback_tokenizer}"
-                )
-            model_name = self._fallback_tokenizer
+        if model_name not in self._tokenizers:
+            try:
+                logger.info(f"Lazy-loading tokenizer for model: {model_name}")
+                self._tokenizers[model_name] = AutoTokenizer.from_pretrained(
+                    model_name, trust_remote_code=True
+                )
+            except Exception as exc:
+                fallback = self._fallback_tokenizer
+                if not fallback:
+                    raise ValueError(
+                        f"No tokenizer loaded for {model_name}"
+                    ) from exc
+                if fallback not in self._tokenizers:
+                    logger.info(f"Lazy-loading fallback tokenizer: {fallback}")
+                    self._tokenizers[fallback] = AutoTokenizer.from_pretrained(
+                        fallback, trust_remote_code=True
+                    )
+                model_name = fallback
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if model_name not in self._tokenizers:
raise ValueError(f"No tokenizer loaded for {model_name}")
if self._fallback_tokenizer not in self._tokenizers:
raise ValueError(
f"No tokenizer loaded for {model_name} or {self._fallback_tokenizer}"
)
model_name = self._fallback_tokenizer
if model_name not in self._tokenizers:
try:
logger.info(f"Lazy-loading tokenizer for model: {model_name}")
self._tokenizers[model_name] = AutoTokenizer.from_pretrained(
model_name, trust_remote_code=True
)
except Exception as exc:
fallback = self._fallback_tokenizer
if not fallback:
raise ValueError(
f"No tokenizer loaded for {model_name}"
) from exc
if fallback not in self._tokenizers:
logger.info(f"Lazy-loading fallback tokenizer: {fallback}")
self._tokenizers[fallback] = AutoTokenizer.from_pretrained(
fallback, trust_remote_code=True
)
model_name = fallback

return self._tokenizers[model_name]

Expand All @@ -57,6 +69,10 @@ def count_tokens(self, text: str, model_name: str) -> int:
tokenizer = self.get_tokenizer(model_name)
return len(tokenizer.encode(text, add_special_tokens=False))

def set_fallback_tokenizer(self, fallback_tokenizer: str) -> None:
"""Set the fallback tokenizer to use if the requested tokenizer is not found."""
self._fallback_tokenizer = fallback_tokenizer


# Global tokenizer service instance
tokenizer_service = TokenizerService()