Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions docs/en/user_guides/stress_test/multi_turn.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The multi-turn conversation benchmark allows you to test a model service in real
| `--min-turns` | `int` | Minimum number of user turns per conversation; used by `random_multi_turn` only | `1` |
| `--max-turns` | `int` | Maximum number of user turns per conversation; **required** for `random_multi_turn`; optional for ShareGPT / `custom_multi_turn` datasets to truncate long conversations; for `swe_smith` live construction, the per-conversation turn count is sampled from `[min_turns, max_turns]` | `None` |
| `--dataset-offset` | `int` | Skip the first N conversations in the dataset; useful for sharded testing or avoiding cache hits | `0` |
| `--max-turn-tokens` | `list[int]` | Per-turn `max_tokens` override; accepts a list of integers specifying the maximum output tokens for each turn by index (0-based). When the list is shorter than the actual turn count, the last value is reused. Only effective in `--multi-turn` mode | `None` |

### `multi_turn_args` (swe_smith-specific parameters)

Expand Down Expand Up @@ -266,6 +267,44 @@ Runtime context structure (when sending turn 2):

> **Note**: The `assistant` messages in the dataset are used only to identify conversation structure and are **never** sent directly to the model. At runtime, workers always append the model's actual output to the context to ensure accurate history.

### Per-turn Output Length Control (`--max-turn-tokens`)

When simulating Agent tool-calling performance, an open-source model cannot produce tool-call structured outputs like the actual model, resulting in different per-turn output lengths. `--max-turn-tokens` allows you to limit the model's output length on a per-turn basis, approximating the context growth behavior of the real model.

**Usage example**: A 10-turn conversation where the first 9 turns simulate tool calls (150 tokens each) and the final turn produces a complete answer (1000 tokens).

First, prepare a JSONL data file (one 10-turn conversation per line, with a system prompt of ~4000 tokens):

```json
[{"role": "system", "content": "<4000 token system prompt>"}, {"role": "user", "content": "Analyze this code"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Provide the final answer"}]
```

> **Note**: The `assistant` messages only define the conversation structure and are replaced by the model's real outputs at runtime.

Then run the benchmark:

```bash
evalscope perf \\
--model YOUR_MODEL \\
--url OPENAI_API_COMPAT_URL \\
--api openai \\
--dataset custom_multi_turn \\
--dataset-path /path/to/tool_call_sim.jsonl \\
--multi-turn \\
--max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \\
--number 50 \\
--parallel 10 \\
--extra-args '{"ignore_eos": true}'
```

| Turn | `max_tokens` | Simulated behavior |
|------|-------------|--------------------|
| Turn 1 | 150 | Simulate initial tool call |
| Turns 2-9 | 150 | Simulate intermediate tool calls |
| Turn 10 | 1000 | Final complete answer |

> **Tip**: The list is automatically extended by reusing the last value for all subsequent turns. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]` (the first turn is 150, and all subsequent turns are 1000).

**Usage example**: You have conversation data already in OpenAI messages format and want to benchmark directly without any format conversion.

First, prepare the JSONL data file (one conversation per line):
Expand Down
1 change: 1 addition & 0 deletions docs/en/user_guides/stress_test/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ Must be used with `--multi-turn`. See the [Multi-turn Benchmark Guide](./multi_t
| `--frequency-penalty` | `float` | frequency_penalty value | - |
| `--logprobs` | `bool` | Whether to return logarithmic probabilities | - |
| `--max-tokens` | `int` | Maximum number of tokens that can be generated | - |
| `--max-turn-tokens` | `int list` | **Multi-turn mode only**: Per-turn override of `max_tokens`<br>• Accepts a list of integers specifying max tokens per turn (0-based index)<br>• Last value is reused if the list is shorter than the actual turn count<br>• Only effective in `--multi-turn` mode<br>• Example: `--max-turn-tokens 150 150 150 1000` | `None` |
| `--min-tokens` | `int` | Minimum number of tokens to generate<br>Note: Not all model services support this parameter<br>For `vLLM>=0.8.1`, you need to additionally set<br>`--extra-args '{"ignore_eos": true}'` | - |
| `--n-choices` | `int` | Number of completion choices to generate | - |
| `--seed` | `int` | Random seed | `None` |
Expand Down
39 changes: 39 additions & 0 deletions docs/zh/user_guides/stress_test/multi_turn.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
| `--min-turns` | `int` | 每个对话最少用户轮数,仅 `random_multi_turn` 使用 | `1` |
| `--max-turns` | `int` | 每个对话最多用户轮数;`random_multi_turn` **必须设置**;ShareGPT / `custom_multi_turn` 等数据集可选,用于截断过长对话;`swe_smith` live 构建时每条对话轮次从 `[min_turns, max_turns]` 随机采样 | `None` |
| `--dataset-offset` | `int` | 跳过数据集前 N 条对话,用于分片测试或避免缓存命中 | `0` |
| `--max-turn-tokens` | `list[int]` | 逐轮 `max_tokens` 覆盖值;接受一个整数列表,按 turn index(从 0 开始)指定每轮的最大输出 token 数。列表短于实际轮数时,复用最后一个值。仅在 `--multi-turn` 模式下生效 | `None` |

### `multi_turn_args`(`swe_smith` 专属参数)

Expand Down Expand Up @@ -266,6 +267,44 @@ evalscope perf \

> **说明**:数据集中的 `assistant` 消息仅用于标识对话结构,**不会**被直接发送给模型。运行时 worker 始终将模型的实际输出追加到上下文,保证历史准确。

### 逐轮控制输出长度(`--max-turn-tokens`)

在模拟 Agent 工具调用性能的场景中,开源模型无法像实际模型那样输出工具调用结构,导致每轮输出长度与实际模型不同。通过 `--max-turn-tokens` 可以逐轮限制模型的输出长度,从而近似模拟实际模型的上下文增长行为。

**使用示例**:10 轮对话,前 9 轮模拟工具调用(各 150 token),最后一轮输出完整回答(1000 token)。

首先准备 JSONL 数据文件(每行一条 10 轮对话,system prompt 约 4000 token):

```json
[{"role": "system", "content": "<4000 token 的系统提示>"}, {"role": "user", "content": "帮我分析这段代码"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "请给出完整的最终回答"}]
```

> **说明**:assistant 消息仅定义对话结构,实际运行中会被模型的真实输出替换。

然后运行压测:

```bash
evalscope perf \\
--model YOUR_MODEL \\
--url OPENAI_API_COMPAT_URL \\
--api openai \\
--dataset custom_multi_turn \\
--dataset-path /path/to/tool_call_sim.jsonl \\
--multi-turn \\
--max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \\
--number 50 \\
--parallel 10 \\
--extra-args '{"ignore_eos": true}'
```

| 轮次 | `max_tokens` | 模拟效果 |
|------|-------------|---------|
| 第 1 轮 | 150 | 模拟首次工具调用 |
| 第 2-9 轮 | 150 | 模拟中间轮工具调用 |
| 第 10 轮 | 1000 | 最终完整回答 |

> **提示**:列表长度不足时自动复用最后一个值给后续所有轮次。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]`(第一轮为 150,后续均为 1000)。

**使用示例**:适用场景:已有 OpenAI messages 格式的对话数据,直接用于多轮压测,无需转换格式。

首先准备 JSONL 数据文件(每行一条对话):
Expand Down
1 change: 1 addition & 0 deletions docs/zh/user_guides/stress_test/parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ SLA自动调优功能使用详见[自动调优指南](./sla_auto_tune.md)。
| `--frequency-penalty` | `float` | frequency_penalty值 | - |
| `--logprobs` | `bool` | 是否返回对数概率 | - |
| `--max-tokens` | `int` 或 `int int` | 可以生成的最大token数量<br>• 单个整数:固定值,如 `--max-tokens 2048`<br>• 两个整数:`最小值 最大值`,每次请求从该范围均匀随机采样,如 `--max-tokens 512 2048` | `2048` |
| `--max-turn-tokens` | `int list` | **多轮模式专属**:逐轮覆盖 `max_tokens`<br>• 接受整数列表,按 turn index(0-based)指定每轮的最大输出 token 数<br>• 列表短于实际轮数时,复用最后一个值<br>• 仅在 `--multi-turn` 模式下生效,否则忽略<br>• 示例:`--max-turn-tokens 150 150 150 1000` | `None` |
| `--min-tokens` | `int` | 生成的最少token数量<br>注意:并非所有模型服务都支持<br>对于`vLLM>=0.8.1`,需额外设置<br>`--extra-args '{"ignore_eos": true}'` | - |
| `--n-choices` | `int` | 生成的补全选择数量 | - |
| `--seed` | `int` | 随机种子 | `None` |
Expand Down
34 changes: 34 additions & 0 deletions evalscope/perf/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,19 @@ def total_count(self) -> int:
Accepts an int or a ``[min, max]`` list for uniform sampling per request.
"""

max_turn_tokens: Optional[List[int]] = None
"""Per-turn max_tokens override for multi-turn mode.

A list of integers specifying max_tokens for each turn index (0-based).
Example: ``[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`` for a
10-turn conversation where the first 9 turns are limited to 150 tokens
and the final turn allows 1000 tokens.

When set, this overrides ``--max-tokens`` on a per-turn basis in
``--multi-turn`` mode. If the list is shorter than the actual turn count,
the last element is reused for remaining turns.
"""

min_tokens: Optional[int] = None
"""Minimum number of tokens in the response."""

Expand Down Expand Up @@ -360,6 +373,21 @@ def _validate_max_tokens(cls, v):
raise ValueError(f'--max-tokens range values must be >= 0, got {v}')
return v

@field_validator('max_turn_tokens', mode='before')
@classmethod
def _validate_max_turn_tokens(cls, v):
if v is None:
return v
# Coerce single int to list for programmatic API support
if isinstance(v, (int, float)):
v = [int(v)]
if isinstance(v, list):
if not v:
raise ValueError('--max-turn-tokens must contain at least one value')
if any(x < 0 for x in v):
raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')
return v
Comment on lines +378 to +389

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two improvements for this validator:

  1. Inconsistency: The max_tokens validator (line 373) allows values >= 0, while this validator requires >= 1. It's better to allow 0 for consistency, as some APIs might use 0 to indicate a default or metadata-only request.
  2. Type Safety: If the user provides a single integer via code (not CLI), v will be an int, and subsequent logic in openai_api.py (which expects a list) will fail. It's safer to coerce single values into a list.
Suggested change
def _validate_max_turn_tokens(cls, v):
if v is None:
return v
if isinstance(v, list):
if not v:
raise ValueError('--max-turn-tokens must contain at least one value')
if any(x < 1 for x in v):
raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')
return v
@field_validator('max_turn_tokens', mode='before')
@classmethod
def _validate_max_turn_tokens(cls, v):
if v is None:
return v
if isinstance(v, (int, float)):
v = [int(v)]
if isinstance(v, list):
if not v:
raise ValueError('--max-turn-tokens must contain at least one value')
if any(x < 0 for x in v):
raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')
return v


@field_validator('multi_turn_args', mode='before')
@classmethod
def _validate_multi_turn_args(cls, v):
Expand Down Expand Up @@ -642,6 +670,12 @@ def add_argument(parser: argparse.ArgumentParser):
parser.add_argument(
'--max-tokens', type=int, nargs='+', help='The maximum number of tokens that can be generated. '
'Accepts 1 value (fixed) or 2 values min max for uniform sampling per request.', default=2048)
parser.add_argument(
'--max-turn-tokens', type=int, nargs='+', default=None,
help='Per-turn max_tokens override for multi-turn mode. '
'Pass a list of integers, one per turn (0-based). '
'If shorter than the turn count, the last value is reused. '
'Example: --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000')
parser.add_argument(
'--min-tokens', type=int, help='The minimum number of tokens that can be generated', default=None)
parser.add_argument('--n-choices', type=int, help='How many completion choices to generate', default=None)
Expand Down
2 changes: 1 addition & 1 deletion evalscope/perf/core/strategies/multi_turn.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ async def _worker(self, worker_id: int) -> None:
await asyncio.sleep(interval)

# Send the turn.
request = self.api_plugin.build_request(list(context))
request = self.api_plugin.build_request(list(context), turn_index=turn_idx)
benchmark_data = await self.client.post(request)

# Inject multi-turn specific metadata.
Expand Down
4 changes: 3 additions & 1 deletion evalscope/perf/plugin/api/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@ def __init__(self, param: Arguments) -> None:
self.model_path = param.tokenizer_path

@abstractmethod
def build_request(self, messages: Union[List[Dict], str], param: Optional[Arguments] = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str], param: Optional[Arguments] = None, turn_index: Optional[int] = None) -> Dict:
"""Build a api request body.

Args:
messages (List[Dict]): The messages generated by dataset.
param (QueryParameters): The query parameters.
turn_index (int, optional): Current turn index in multi-turn mode.
Used for per-turn max_tokens override via ``--max-turn-tokens``.

Raises:
NotImplementedError: Not implemented.
Expand Down
4 changes: 2 additions & 2 deletions evalscope/perf/plugin/api/custom_api.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import aiohttp
import json
from typing import Any, AsyncGenerator, Dict, List, Tuple, Union
from typing import Any, AsyncGenerator, Dict, List, Tuple, Union, Optional

from evalscope.perf.arguments import Arguments
from evalscope.perf.multi_turn_args import _sample_int_or_range
Expand Down Expand Up @@ -37,7 +37,7 @@ def __init__(self, param: Arguments):
else:
self.tokenizer = None

def build_request(self, messages: Union[List[Dict], str], param: Arguments = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change
def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

"""Build a custom API request body based on the input messages and parameters.

This method formats the input messages into the expected request format
Expand Down
4 changes: 2 additions & 2 deletions evalscope/perf/plugin/api/dashscope_api.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import json
import os
from typing import Any, Dict, Iterator, List
from typing import Any, Dict, Iterator, List, Optional

from evalscope.perf.arguments import Arguments
from evalscope.perf.multi_turn_args import _sample_int_or_range
Expand All @@ -17,7 +17,7 @@ class DashScopeApiPlugin(ApiPluginBase):
def __init__(self, param: Arguments):
super().__init__(param)

def build_request(self, messages: List[Dict], param: Arguments = None) -> Dict:
def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Any is already imported, you can use Any as a fallback or add the missing import.

Suggested change
def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict:

"""Build the openai format request based on prompt, dataset

Args:
Expand Down
40 changes: 34 additions & 6 deletions evalscope/perf/plugin/api/openai_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import math
import os
from collections import defaultdict
from typing import Any, Dict, List, Tuple, Union
from typing import Any, Dict, List, Tuple, Union, Optional

from evalscope.perf.arguments import Arguments
from evalscope.perf.multi_turn_args import _sample_int_or_range
Expand All @@ -14,6 +14,23 @@

logger = get_logger()

_TOOL_CONTEXT_KEY = "__evalscope_tools__"


def _extract_tools(messages) -> Optional[List[Dict]]:
"""Extract tools definitions from messages if embedded by the dataset plugin.

Scans the first message for the internal tools key. If found, removes it
from the message to keep the payload clean before sending.
"""
if not isinstance(messages, list):
return None
for msg in messages:
if isinstance(msg, dict) and _TOOL_CONTEXT_KEY in msg:
tools = msg.pop(_TOOL_CONTEXT_KEY)
return tools
return None


@register_api(['openai', 'local_vllm', 'local'])
class OpenaiPlugin(DefaultApiPlugin):
Expand All @@ -33,14 +50,16 @@ def __init__(self, param: Arguments):
else:
self.tokenizer = None

def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param: Arguments = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
Comment thread
Yunnglin marked this conversation as resolved.
"""Build the openai format request based on prompt, dataset

Args:
messages (List[Dict] | str | List[int] | Dict): The basic message to generator query.
When param.tokenize_prompt is True, this may also be a list of token IDs
(List[int]) produced by the random dataset plugin.
param (QueryParameters): The query parameters.
turn_index (int, optional): Current turn index in multi-turn mode.
Used for per-turn max_tokens override via ``--max-turn-tokens``.

Raises:
Exception: NotImplemented
Expand All @@ -50,12 +69,15 @@ def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param
"""
param = param or self.param
try:
# Extract tools definitions embedded by the dataset plugin.
tools = _extract_tools(messages)

# --tokenize-prompt path: convert messages/text/token-IDs to a token-ID list
# and send as a /v1/completions request with `prompt=[int, ...]`.
if param.tokenize_prompt and not isinstance(messages, dict):
token_ids = self._messages_to_token_ids(messages, param)
query = {'prompt': token_ids}
return self.__compose_query_from_parameter(query, param)
return self.__compose_query_from_parameter(query, param, turn_index, tools)

if param.query_template is not None:
if param.query_template.startswith('@'):
Expand All @@ -76,7 +98,7 @@ def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param
query = {'prompt': messages}
else:
query = {'messages': messages}
return self.__compose_query_from_parameter(query, param)
return self.__compose_query_from_parameter(query, param, turn_index, tools)
except Exception as e:
logger.exception(e)
return None
Expand Down Expand Up @@ -112,9 +134,15 @@ def _messages_to_token_ids(self, messages: Union[List[Dict], str, List[int]], pa
logger.warning(f'_messages_to_token_ids: unexpected messages type {type(messages)}, returning []')
return []

def __compose_query_from_parameter(self, payload: Dict, param: Arguments):
def __compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None, tools: Optional[List[Dict]] = None):
payload['model'] = param.model
if param.max_tokens is not None:
if tools:
payload['tools'] = tools
if param.max_turn_tokens is not None and turn_index is not None:
# Per-turn max_tokens override for multi-turn mode.
idx = min(turn_index, len(param.max_turn_tokens) - 1)
payload['max_tokens'] = param.max_turn_tokens[idx]
elif param.max_tokens is not None:
payload['max_tokens'] = _sample_int_or_range(param.max_tokens)
if param.min_tokens is not None:
payload['min_tokens'] = param.min_tokens
Expand Down
Loading
Loading