modelscope · pjgao · May 22, 2026 · May 22, 2026 · May 22, 2026 · May 22, 2026
diff --git a/docs/en/user_guides/stress_test/multi_turn.md b/docs/en/user_guides/stress_test/multi_turn.md
@@ -19,6 +19,7 @@ The multi-turn conversation benchmark allows you to test a model service in real
 | `--min-turns` | `int` | Minimum number of user turns per conversation; used by `random_multi_turn` only | `1` |
 | `--max-turns` | `int` | Maximum number of user turns per conversation; **required** for `random_multi_turn`; optional for ShareGPT / `custom_multi_turn` datasets to truncate long conversations; for `swe_smith` live construction, the per-conversation turn count is sampled from `[min_turns, max_turns]` | `None` |
 | `--dataset-offset` | `int` | Skip the first N conversations in the dataset; useful for sharded testing or avoiding cache hits | `0` |
+| `--max-turn-tokens` | `list[int]` | Per-turn `max_tokens` override; accepts a list of integers specifying the maximum output tokens for each turn by index (0-based). When the list is shorter than the actual turn count, the last value is reused. Only effective in `--multi-turn` mode | `None` |
 
 ### `multi_turn_args` (swe_smith-specific parameters)
 
@@ -266,6 +267,44 @@ Runtime context structure (when sending turn 2):
 
 > **Note**: The `assistant` messages in the dataset are used only to identify conversation structure and are **never** sent directly to the model. At runtime, workers always append the model's actual output to the context to ensure accurate history.
 
+### Per-turn Output Length Control (`--max-turn-tokens`)
+
+When simulating Agent tool-calling performance, an open-source model cannot produce tool-call structured outputs like the actual model, resulting in different per-turn output lengths. `--max-turn-tokens` allows you to limit the model's output length on a per-turn basis, approximating the context growth behavior of the real model.
+
+**Usage example**: A 10-turn conversation where the first 9 turns simulate tool calls (150 tokens each) and the final turn produces a complete answer (1000 tokens).
+
+First, prepare a JSONL data file (one 10-turn conversation per line, with a system prompt of ~4000 tokens):
+
+```json
+[{"role": "system", "content": "<4000 token system prompt>"}, {"role": "user", "content": "Analyze this code"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Continue"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "Provide the final answer"}]
+```
+
+> **Note**: The `assistant` messages only define the conversation structure and are replaced by the model's real outputs at runtime.
+
+Then run the benchmark:
+
+```bash
+evalscope perf \\
+  --model YOUR_MODEL \\
+  --url OPENAI_API_COMPAT_URL \\
+  --api openai \\
+  --dataset custom_multi_turn \\
+  --dataset-path /path/to/tool_call_sim.jsonl \\
+  --multi-turn \\
+  --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \\
+  --number 50 \\
+  --parallel 10 \\
+  --extra-args '{"ignore_eos": true}'
+```
+
+| Turn | `max_tokens` | Simulated behavior |
+|------|-------------|--------------------|
+| Turn 1 | 150 | Simulate initial tool call |
+| Turns 2-9 | 150 | Simulate intermediate tool calls |
+| Turn 10 | 1000 | Final complete answer |
+
+> **Tip**: The list is automatically extended by reusing the last value for all subsequent turns. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]` (the first turn is 150, and all subsequent turns are 1000).
+
 **Usage example**: You have conversation data already in OpenAI messages format and want to benchmark directly without any format conversion.
 
 First, prepare the JSONL data file (one conversation per line):

diff --git a/docs/en/user_guides/stress_test/parameters.md b/docs/en/user_guides/stress_test/parameters.md
@@ -143,6 +143,7 @@ Must be used with `--multi-turn`. See the [Multi-turn Benchmark Guide](./multi_t
 | `--frequency-penalty` | `float` | frequency_penalty value | - |
 | `--logprobs` | `bool` | Whether to return logarithmic probabilities | - |
 | `--max-tokens` | `int` | Maximum number of tokens that can be generated | - |
+| `--max-turn-tokens` | `int list` | **Multi-turn mode only**: Per-turn override of `max_tokens`<br>• Accepts a list of integers specifying max tokens per turn (0-based index)<br>• Last value is reused if the list is shorter than the actual turn count<br>• Only effective in `--multi-turn` mode<br>• Example: `--max-turn-tokens 150 150 150 1000` | `None` |
 | `--min-tokens` | `int` | Minimum number of tokens to generate<br>Note: Not all model services support this parameter<br>For `vLLM>=0.8.1`, you need to additionally set<br>`--extra-args '{"ignore_eos": true}'` | - |
 | `--n-choices` | `int` | Number of completion choices to generate | - |
 | `--seed` | `int` | Random seed | `None` |

diff --git a/docs/zh/user_guides/stress_test/multi_turn.md b/docs/zh/user_guides/stress_test/multi_turn.md
@@ -19,6 +19,7 @@
 | `--min-turns` | `int` | 每个对话最少用户轮数，仅 `random_multi_turn` 使用 | `1` |
 | `--max-turns` | `int` | 每个对话最多用户轮数；`random_multi_turn` **必须设置**；ShareGPT / `custom_multi_turn` 等数据集可选，用于截断过长对话；`swe_smith` live 构建时每条对话轮次从 `[min_turns, max_turns]` 随机采样 | `None` |
 | `--dataset-offset` | `int` | 跳过数据集前 N 条对话，用于分片测试或避免缓存命中 | `0` |
+| `--max-turn-tokens` | `list[int]` | 逐轮 `max_tokens` 覆盖值；接受一个整数列表，按 turn index（从 0 开始）指定每轮的最大输出 token 数。列表短于实际轮数时，复用最后一个值。仅在 `--multi-turn` 模式下生效 | `None` |
 
 ### `multi_turn_args`（`swe_smith` 专属参数）
 
@@ -266,6 +267,44 @@ evalscope perf \
 
 > **说明**：数据集中的 `assistant` 消息仅用于标识对话结构，**不会**被直接发送给模型。运行时 worker 始终将模型的实际输出追加到上下文，保证历史准确。
 
+### 逐轮控制输出长度（`--max-turn-tokens`）
+
+在模拟 Agent 工具调用性能的场景中，开源模型无法像实际模型那样输出工具调用结构，导致每轮输出长度与实际模型不同。通过 `--max-turn-tokens` 可以逐轮限制模型的输出长度，从而近似模拟实际模型的上下文增长行为。
+
+**使用示例**：10 轮对话，前 9 轮模拟工具调用（各 150 token），最后一轮输出完整回答（1000 token）。
+
+首先准备 JSONL 数据文件（每行一条 10 轮对话，system prompt 约 4000 token）：
+
+```json
+[{"role": "system", "content": "<4000 token 的系统提示>"}, {"role": "user", "content": "帮我分析这段代码"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "继续"}, {"role": "assistant", "content": "x"}, {"role": "user", "content": "请给出完整的最终回答"}]
+```
+
+> **说明**：assistant 消息仅定义对话结构，实际运行中会被模型的真实输出替换。
+
+然后运行压测：
+
+```bash
+evalscope perf \\
+  --model YOUR_MODEL \\
+  --url OPENAI_API_COMPAT_URL \\
+  --api openai \\
+  --dataset custom_multi_turn \\
+  --dataset-path /path/to/tool_call_sim.jsonl \\
+  --multi-turn \\
+  --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \\
+  --number 50 \\
+  --parallel 10 \\
+  --extra-args '{"ignore_eos": true}'
+```
+
+| 轮次 | `max_tokens` | 模拟效果 |
+|------|-------------|---------|
+| 第 1 轮 | 150 | 模拟首次工具调用 |
+| 第 2-9 轮 | 150 | 模拟中间轮工具调用 |
+| 第 10 轮 | 1000 | 最终完整回答 |
+
+> **提示**：列表长度不足时自动复用最后一个值给后续所有轮次。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]`（第一轮为 150，后续均为 1000）。
+
 **使用示例**：适用场景：已有 OpenAI messages 格式的对话数据，直接用于多轮压测，无需转换格式。
 
 首先准备 JSONL 数据文件（每行一条对话）：

diff --git a/docs/zh/user_guides/stress_test/parameters.md b/docs/zh/user_guides/stress_test/parameters.md
@@ -144,6 +144,7 @@ SLA自动调优功能使用详见[自动调优指南](./sla_auto_tune.md)。
 | `--frequency-penalty` | `float` | frequency_penalty值 | - |
 | `--logprobs` | `bool` | 是否返回对数概率 | - |
 | `--max-tokens` | `int` 或 `int int` | 可以生成的最大token数量<br>• 单个整数：固定值，如 `--max-tokens 2048`<br>• 两个整数：`最小值 最大值`，每次请求从该范围均匀随机采样，如 `--max-tokens 512 2048` | `2048` |
+| `--max-turn-tokens` | `int list` | **多轮模式专属**：逐轮覆盖 `max_tokens`<br>• 接受整数列表，按 turn index（0-based）指定每轮的最大输出 token 数<br>• 列表短于实际轮数时，复用最后一个值<br>• 仅在 `--multi-turn` 模式下生效，否则忽略<br>• 示例：`--max-turn-tokens 150 150 150 1000` | `None` |
 | `--min-tokens` | `int` | 生成的最少token数量<br>注意：并非所有模型服务都支持<br>对于`vLLM>=0.8.1`，需额外设置<br>`--extra-args '{"ignore_eos": true}'` | - |
 | `--n-choices` | `int` | 生成的补全选择数量 | - |
 | `--seed` | `int` | 随机种子 | `None` |

diff --git a/evalscope/perf/arguments.py b/evalscope/perf/arguments.py
@@ -275,6 +275,19 @@ def total_count(self) -> int:
     Accepts an int or a ``[min, max]`` list for uniform sampling per request.
     """
 
+    max_turn_tokens: Optional[List[int]] = None
+    """Per-turn max_tokens override for multi-turn mode.
+
+    A list of integers specifying max_tokens for each turn index (0-based).
+    Example: ``[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`` for a
+    10-turn conversation where the first 9 turns are limited to 150 tokens
+    and the final turn allows 1000 tokens.
+
+    When set, this overrides ``--max-tokens`` on a per-turn basis in
+    ``--multi-turn`` mode.  If the list is shorter than the actual turn count,
+    the last element is reused for remaining turns.
+    """
+
     min_tokens: Optional[int] = None
     """Minimum number of tokens in the response."""
 
@@ -360,6 +373,21 @@ def _validate_max_tokens(cls, v):
                 raise ValueError(f'--max-tokens range values must be >= 0, got {v}')
         return v
 
+    @field_validator('max_turn_tokens', mode='before')
+    @classmethod
+    def _validate_max_turn_tokens(cls, v):
+        if v is None:
+            return v
+        # Coerce single int to list for programmatic API support
+        if isinstance(v, (int, float)):
+            v = [int(v)]
+        if isinstance(v, list):
+            if not v:
+                raise ValueError('--max-turn-tokens must contain at least one value')
+            if any(x < 0 for x in v):
+                raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')
+        return v
-    def _validate_max_turn_tokens(cls, v):
-        if v is None:
-            return v
-        if isinstance(v, list):
-            if not v:
-                raise ValueError('--max-turn-tokens must contain at least one value')
-            if any(x < 1 for x in v):
-                raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')
-        return v
+    @field_validator('max_turn_tokens', mode='before')
+    @classmethod
+    def _validate_max_turn_tokens(cls, v):
+        if v is None:
+            return v
+        if isinstance(v, (int, float)):
+            v = [int(v)]
+        if isinstance(v, list):
+            if not v:
+                raise ValueError('--max-turn-tokens must contain at least one value')
+            if any(x < 0 for x in v):
+                raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')
+        return v
-    def _validate_max_turn_tokens(cls, v):
-        if v is None:
-            return v
-        if isinstance(v, list):
-            if not v:
-                raise ValueError('--max-turn-tokens must contain at least one value')
-            if any(x < 1 for x in v):
-                raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')
-        return v
+    @field_validator('max_turn_tokens', mode='before')
+    @classmethod
+    def _validate_max_turn_tokens(cls, v):
+        if v is None:
+            return v
+        if isinstance(v, (int, float)):
+            v = [int(v)]
+        if isinstance(v, list):
+            if not v:
+                raise ValueError('--max-turn-tokens must contain at least one value')
+            if any(x < 0 for x in v):
+                raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')
+        return v
+
     @field_validator('multi_turn_args', mode='before')
     @classmethod
     def _validate_multi_turn_args(cls, v):
@@ -642,6 +670,12 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument(
         '--max-tokens', type=int, nargs='+', help='The maximum number of tokens that can be generated. '
         'Accepts 1 value (fixed) or 2 values min max for uniform sampling per request.', default=2048)
+    parser.add_argument(
+        '--max-turn-tokens', type=int, nargs='+', default=None,
+        help='Per-turn max_tokens override for multi-turn mode. '
+        'Pass a list of integers, one per turn (0-based). '
+        'If shorter than the turn count, the last value is reused. '
+        'Example: --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000')
     parser.add_argument(
         '--min-tokens', type=int, help='The minimum number of tokens that can be generated', default=None)
     parser.add_argument('--n-choices', type=int, help='How many completion choices to generate', default=None)

diff --git a/evalscope/perf/core/strategies/multi_turn.py b/evalscope/perf/core/strategies/multi_turn.py
@@ -116,7 +116,7 @@ async def _worker(self, worker_id: int) -> None:
                     await asyncio.sleep(interval)
 
                 # Send the turn.
-                request = self.api_plugin.build_request(list(context))
+                request = self.api_plugin.build_request(list(context), turn_index=turn_idx)
                 benchmark_data = await self.client.post(request)
 
                 # Inject multi-turn specific metadata.

diff --git a/evalscope/perf/plugin/api/base.py b/evalscope/perf/plugin/api/base.py
@@ -13,12 +13,14 @@ def __init__(self, param: Arguments) -> None:
         self.model_path = param.tokenizer_path
 
     @abstractmethod
-    def build_request(self, messages: Union[List[Dict], str], param: Optional[Arguments] = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str], param: Optional[Arguments] = None, turn_index: Optional[int] = None) -> Dict:
         """Build a api request body.
 
         Args:
             messages (List[Dict]): The messages generated by dataset.
             param (QueryParameters): The query parameters.
+            turn_index (int, optional): Current turn index in multi-turn mode.
+                Used for per-turn max_tokens override via ``--max-turn-tokens``.
 
         Raises:
             NotImplementedError: Not implemented.

diff --git a/evalscope/perf/plugin/api/custom_api.py b/evalscope/perf/plugin/api/custom_api.py
@@ -1,6 +1,6 @@
 import aiohttp
 import json
-from typing import Any, AsyncGenerator, Dict, List, Tuple, Union
+from typing import Any, AsyncGenerator, Dict, List, Tuple, Union, Optional
 
 from evalscope.perf.arguments import Arguments
 from evalscope.perf.multi_turn_args import _sample_int_or_range
@@ -37,7 +37,7 @@ def __init__(self, param: Arguments):
         else:
             self.tokenizer = None
 
-    def build_request(self, messages: Union[List[Dict], str], param: Arguments = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
-    def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:
-    def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:
         """Build a custom API request body based on the input messages and parameters.
 
         This method formats the input messages into the expected request format

diff --git a/evalscope/perf/plugin/api/dashscope_api.py b/evalscope/perf/plugin/api/dashscope_api.py
@@ -1,6 +1,6 @@
 import json
 import os
-from typing import Any, Dict, Iterator, List
+from typing import Any, Dict, Iterator, List, Optional
 
 from evalscope.perf.arguments import Arguments
 from evalscope.perf.multi_turn_args import _sample_int_or_range
@@ -17,7 +17,7 @@ class DashScopeApiPlugin(ApiPluginBase):
     def __init__(self, param: Arguments):
         super().__init__(param)
 
-    def build_request(self, messages: List[Dict], param: Arguments = None) -> Dict:
+    def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
-    def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
+    def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict:
-    def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
+    def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict:
         """Build the openai format request based on prompt, dataset
 
         Args:

diff --git a/evalscope/perf/plugin/api/openai_api.py b/evalscope/perf/plugin/api/openai_api.py
@@ -2,7 +2,7 @@
 import math
 import os
 from collections import defaultdict
-from typing import Any, Dict, List, Tuple, Union
+from typing import Any, Dict, List, Tuple, Union, Optional
 
 from evalscope.perf.arguments import Arguments
 from evalscope.perf.multi_turn_args import _sample_int_or_range
@@ -14,6 +14,23 @@
 
 logger = get_logger()
 
+_TOOL_CONTEXT_KEY = "__evalscope_tools__"
+
+
+def _extract_tools(messages) -> Optional[List[Dict]]:
+    """Extract tools definitions from messages if embedded by the dataset plugin.
+
+    Scans the first message for the internal tools key. If found, removes it
+    from the message to keep the payload clean before sending.
+    """
+    if not isinstance(messages, list):
+        return None
+    for msg in messages:
+        if isinstance(msg, dict) and _TOOL_CONTEXT_KEY in msg:
+            tools = msg.pop(_TOOL_CONTEXT_KEY)
+            return tools
+    return None
+
 
 @register_api(['openai', 'local_vllm', 'local'])
 class OpenaiPlugin(DefaultApiPlugin):
@@ -33,14 +50,16 @@ def __init__(self, param: Arguments):
         else:
             self.tokenizer = None
 
-    def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param: Arguments = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
         """Build the openai format request based on prompt, dataset
 
         Args:
             messages (List[Dict] | str | List[int] | Dict): The basic message to generator query.
                 When param.tokenize_prompt is True, this may also be a list of token IDs
                 (List[int]) produced by the random dataset plugin.
             param (QueryParameters): The query parameters.
+            turn_index (int, optional): Current turn index in multi-turn mode.
+                Used for per-turn max_tokens override via ``--max-turn-tokens``.
 
         Raises:
             Exception: NotImplemented
@@ -50,12 +69,15 @@ def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param
         """
         param = param or self.param
         try:
+            # Extract tools definitions embedded by the dataset plugin.
+            tools = _extract_tools(messages)
+
             # --tokenize-prompt path: convert messages/text/token-IDs to a token-ID list
             # and send as a /v1/completions request with `prompt=[int, ...]`.
             if param.tokenize_prompt and not isinstance(messages, dict):
                 token_ids = self._messages_to_token_ids(messages, param)
                 query = {'prompt': token_ids}
-                return self.__compose_query_from_parameter(query, param)
+                return self.__compose_query_from_parameter(query, param, turn_index, tools)
 
             if param.query_template is not None:
                 if param.query_template.startswith('@'):
@@ -76,7 +98,7 @@ def build_request(self, messages: Union[List[Dict], str, List[int], Dict], param
                 query = {'prompt': messages}
             else:
                 query = {'messages': messages}
-            return self.__compose_query_from_parameter(query, param)
+            return self.__compose_query_from_parameter(query, param, turn_index, tools)
         except Exception as e:
             logger.exception(e)
             return None
@@ -112,9 +134,15 @@ def _messages_to_token_ids(self, messages: Union[List[Dict], str, List[int]], pa
         logger.warning(f'_messages_to_token_ids: unexpected messages type {type(messages)}, returning []')
         return []
 
-    def __compose_query_from_parameter(self, payload: Dict, param: Arguments):
+    def __compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None, tools: Optional[List[Dict]] = None):
         payload['model'] = param.model
-        if param.max_tokens is not None:
+        if tools:
+            payload['tools'] = tools
+        if param.max_turn_tokens is not None and turn_index is not None:
+            # Per-turn max_tokens override for multi-turn mode.
+            idx = min(turn_index, len(param.max_turn_tokens) - 1)
+            payload['max_tokens'] = param.max_turn_tokens[idx]
+        elif param.max_tokens is not None:
             payload['max_tokens'] = _sample_int_or_range(param.max_tokens)
         if param.min_tokens is not None:
             payload['min_tokens'] = param.min_tokens