[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode by pjgao · Pull Request #1359 · modelscope/evalscope

pjgao · 2026-05-22T01:12:14Z

Design & Problem Statement

Problem

When using evalscope perf --multi-turn to simulate Agent tool-calling performance, the actual model produces tool-call structured outputs of specific lengths. Using an open-source model (e.g., Qwen) for simulation has a fundamental mismatch:

The open-source model cannot produce tool-call outputs → output length differs from the real model
The existing --max-tokens is a global parameter → cannot set different values per turn
Without per-turn control, context growth behavior is inaccurate, making the stress test results non-representative

Concrete scenario: 10-turn conversation simulating tool calls:

Turns 1-9: model should output ~150 tokens (simulating tool calls)
Turn 10: model should output ~1000 tokens (final complete answer)
System prompt: ~4000 tokens, User question: ~20 tokens

Solution Design

New Parameter: `--max-turn-tokens`

--max-turn-tokens 150 150 150 150 150 150 150 150 150 1000

Accepts a list of integers, one per turn (0-based index)
When turn_index < list length: uses max_turn_tokens[turn_index]
When turn_index >= list length: reuses the last value (auto-extend)
Only effective in --multi-turn mode; ignored otherwise
Backward compatible: default is None, existing behavior unchanged

Architecture

MultiTurnStrategy._worker()
  └── for turn_idx, turn_delta in enumerate(conversation):
        └── api_plugin.build_request(context, turn_index=turn_idx)
              └── OpenaiPlugin.__compose_query_from_parameter(payload, param, turn_index)
                    └── if max_turn_tokens and turn_index:
                          payload["max_tokens"] = max_turn_tokens[min(turn_index, len-1)]
                        else:
                          payload["max_tokens"] = sample(max_tokens)  # existing behavior

Usage Example

Simulate 10-turn tool-calling performance:

Prepare JSONL (one 10-turn conversation per line):

[{"role":"system","content":"<4000 token prompt>"},{"role":"user","content":"<20 token question>"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"请给出完整的最终回答"}]

assistant messages only define structure; replaced by real model outputs at runtime.

Run benchmark:

evalscope perf \
  --model YOUR_MODEL \
  --url OPENAI_API_COMPAT_URL \
  --api openai \
  --dataset custom_multi_turn \
  --dataset-path tool_call_sim.jsonl \
  --multi-turn \
  --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \
  --number 50 \
  --parallel 10 \
  --extra-args '{"ignore_eos": true}'

Turn	`max_tokens`	Behavior
1	150	Initial tool call simulation
2-9	150	Intermediate tool calls
10	1000	Final complete answer

Changed Files

Core Logic (4 files)

File	Change
`evalscope/perf/arguments.py`	+ `max_turn_tokens` Pydantic field with validation; + `--max-turn-tokens` CLI arg
`evalscope/perf/core/strategies/multi_turn.py`	Pass `turn_index` to `api_plugin.build_request()`
`evalscope/perf/plugin/api/openai_api.py`	`__compose_query_from_parameter`: honor `max_turn_tokens` when `turn_index` is provided
`evalscope/perf/plugin/api/openai_responses_api.py`	`_compose_query_from_parameter`: same for Responses API

API Signature Compatibility (5 files)

File	Change
`evalscope/perf/plugin/api/base.py`	Add `turn_index: Optional[int] = None` to abstract `build_request`
`evalscope/perf/plugin/api/dashscope_api.py`	Accept `turn_index` (passthrough, no logic change)
`evalscope/perf/plugin/api/custom_api.py`	Accept `turn_index` (passthrough)
`evalscope/perf/plugin/api/openai_embedding_api.py`	Accept `turn_index` (passthrough)
`evalscope/perf/plugin/api/openai_rerank_api.py`	Accept `turn_index` (passthrough)

Documentation (4 files)

File	Change
`docs/zh/user_guides/stress_test/multi_turn.md`	New section: 逐轮控制输出长度 + example
`docs/en/user_guides/stress_test/multi_turn.md`	New section: Per-turn Output Length Control + example
`docs/zh/user_guides/stress_test/parameters.md`	New parameter row for `--max-turn-tokens`
`docs/en/user_guides/stress_test/parameters.md`	New parameter row for `--max-turn-tokens`

Related Issue

Fixes Feature Request: 多轮对话压测支持逐轮配置 max_tokens 和自定义 system prompt #1358

… multi-turn mode Support specifying different max_tokens per turn in multi-turn stress test mode. This is essential for simulating agent tool-calling scenarios where early turns produce short outputs (e.g., 150 tokens) and the final turn produces a longer response (e.g., 1000 tokens). Usage: evalscope perf --multi-turn --max-turn-tokens 150 150 1000 If the list is shorter than the actual turn count, the last value is reused for remaining turns. Changes: - Arguments: new max_turn_tokens field with validation and CLI arg - MultiTurnStrategy: pass turn_index to api_plugin.build_request() - ApiPluginBase: add turn_index param to build_request signature - OpenaiPlugin: honor max_turn_tokens when composing request - OpenAIResponsesPlugin: same for Responses API - All other API plugins: accept turn_index for signature compat

Update Chinese and English docs for multi_turn and parameters pages: - multi_turn.md: new section explaining per-turn output length control with a concrete tool-call simulation example (150/1000 tokens) - parameters.md: new row for --max-turn-tokens parameter

gemini-code-assist

Code Review

This pull request introduces the --max-turn-tokens parameter to enable per-turn control of output lengths during multi-turn performance benchmarks, facilitating more accurate simulations of agent tool-calling behaviors. The changes span documentation, argument validation, and API plugin updates. Review feedback correctly identifies several critical issues: the Optional type hint is used in multiple files without being imported, which will cause runtime errors; the validator for max_turn_tokens is inconsistent with existing token validation logic and lacks type safety for single-integer inputs; and the documentation examples incorrectly describe the logic for extending the token list when it is shorter than the total number of turns.

gemini-code-assist · 2026-05-22T01:14:35Z

            self.tokenizer = None

-    def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:


The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change

def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

gemini-code-assist · 2026-05-22T01:14:36Z

                output.real_cached_tokens = cached

-    def _compose_query_from_parameter(self, payload: Dict, param: Arguments) -> Dict:
+    def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict:


The Optional type hint is used here but it is not imported from the typing module. Please use Union[int, None] or add the missing import.

Suggested change

def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict:

def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Union[int, None] = None) -> Dict:

gemini-code-assist · 2026-05-22T01:14:36Z

        super().__init__(param)

-    def build_request(self, messages: List[Dict], param: Arguments = None) -> Dict:
+    def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:


The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Any is already imported, you can use Any as a fallback or add the missing import.

Suggested change

def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict:

gemini-code-assist · 2026-05-22T01:14:36Z

            self.tokenizer = None

-    def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:


The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change

def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

gemini-code-assist · 2026-05-22T01:14:36Z

            self.tokenizer = None

-    def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None) -> Dict:
+    def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:


The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change

def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

gemini-code-assist · 2026-05-22T01:14:36Z

+    def _validate_max_turn_tokens(cls, v):
+        if v is None:
+            return v
+        if isinstance(v, list):
+            if not v:
+                raise ValueError('--max-turn-tokens must contain at least one value')
+            if any(x < 1 for x in v):
+                raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')
+        return v


There are two improvements for this validator:

Inconsistency: The max_tokens validator (line 373) allows values >= 0, while this validator requires >= 1. It's better to allow 0 for consistency, as some APIs might use 0 to indicate a default or metadata-only request.

Type Safety: If the user provides a single integer via code (not CLI), v will be an int, and subsequent logic in openai_api.py (which expects a list) will fail. It's safer to coerce single values into a list.

Suggested change

def _validate_max_turn_tokens(cls, v):

if v is None:

return v

if isinstance(v, list):

if not v:

raise ValueError('--max-turn-tokens must contain at least one value')

if any(x < 1 for x in v):

raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')

return v

@field_validator('max_turn_tokens', mode='before')

@classmethod

def _validate_max_turn_tokens(cls, v):

if v is None:

return v

if isinstance(v, (int, float)):

v = [int(v)]

if isinstance(v, list):

if not v:

raise ValueError('--max-turn-tokens must contain at least one value')

if any(x < 0 for x in v):

raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')

return v

gemini-code-assist · 2026-05-22T01:14:36Z

+| Turns 2-9 | 150 | Simulate intermediate tool calls |
+| Turn 10 | 1000 | Final complete answer |
+
+> **Tip**: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`.


The example in this tip is incorrect based on the implementation logic. The code reuses the last element of the list for all subsequent turns. Therefore, --max-turn-tokens 150 1000 would result in [150, 1000, 1000, ...] (where turn 0 is 150 and all others are 1000), not the sequence shown in the tip. To achieve the sequence in the tip, the user must provide the full list.

Suggested change

> **Tip**: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`.

> **Tip**: The list is automatically extended by reusing the last value for all subsequent turns. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]` (the first turn is 150, and all subsequent turns are 1000).

gemini-code-assist · 2026-05-22T01:14:36Z

+| 第 2-9 轮 | 150 | 模拟中间轮工具调用 |
+| 第 10 轮 | 1000 | 最终完整回答 |
+
+> **提示**：列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`。


此处提示中的示例与代码实现逻辑不符。代码逻辑是复用列表的最后一个元素给后续所有轮次。因此 --max-turn-tokens 150 1000 在 10 轮对话中的效果是 [150, 1000, 1000, ...]（即第一轮为 150，后续均为 1000），而非提示中所写的序列。若要实现提示中的效果，用户需要提供完整的列表。

Suggested change

> **提示**：列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`。

> **提示**：列表长度不足时自动复用最后一个值给后续所有轮次。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]`（第一轮为 150，后续轮次均为 1000）。

…examples - Add missing Optional import to openai_api, openai_responses_api, dashscope_api, custom_api, openai_embedding_api, openai_rerank_api - Improve max_turn_tokens validator: allow >= 0 for consistency with max_tokens, coerce single int to list for programmatic API - Fix doc tip examples: correct the list extension behavior description in both en and zh multi_turn.md

pjgao · 2026-05-22T02:08:57Z

感谢审查意见，已全部修复（commit ）：

✅ 已修复 - 8 处 Optional 导入缺失

在 openai_api.py、openai_responses_api.py、dashscope_api.py、custom_api.py、openai_embedding_api.py、openai_rerank_api.py 中补充了 Optional 导入。base.py 已有该导入，无需修改。

✅ 已修复 - Validator 改进

将 >= 1 改为 >= 0，与 max_tokens 保持一致
增加单个 int 自动转 list 的兼容处理，支持编程式 API 调用

✅ 已修复 - 文档 Tip 示例

修正了中英文文档中 --max-turn-tokens 150 1000 的实际展开结果：

原文错误描述为 [150, 150, ..., 1000]
已更正为 [150, 1000, 1000, ..., 1000]（正确反映复用最后一个值的逻辑）

- custom_multi_turn: extract tools from JSON data and embed into first turn - openai_api: extract embedded tools and inject into request payload - openai_responses_api: same tools support for Responses API - Supports JSON format: {"messages": [...], "tools": [...]} - Backward compatible: works without tools definitions

Yunnglin

Thanks for the updates! A few remaining items before this is ready to merge:

1. Bug: _iter_jsonl signature mismatch

_iter_jsonl is defined without parameters:

def _iter_jsonl(self) -> Iterator[Tuple[List[Message], Optional[List[Dict]]]]:

but called with path in two places:

yield from self._iter_jsonl(path)

This will raise TypeError at runtime. Since the method already reads path from self.query_parameters.dataset_path internally, the call sites should drop the argument:

yield from self._iter_jsonl()

2. Pre-commit CI still failing (3 hooks)

isort: import order in custom_api.py, openai_api.py, openai_embedding_api.py, openai_rerank_api.py, openai_responses_api.py
yapf: formatting issues
double-quote-string-fixer: double quotes in openai_responses_api.py, openai_api.py, custom.py

You can run pre-commit run --all-files locally to auto-fix most of these.

3. Title

Don't forget to remove [WIP] from the title when ready.

pjgao added 2 commits May 22, 2026 09:04

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

pjgao changed the title ~~feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode~~ [WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode May 22, 2026

Yunnglin reviewed May 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode#1359

[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode#1359
pjgao wants to merge 4 commits into
modelscope:mainfrom
pjgao:feat/multi-turn-tokens

pjgao commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

pjgao commented May 22, 2026

Uh oh!

Yunnglin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
	def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

	def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict:
	def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Union[int, None] = None) -> Dict:

	def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
	def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict:

	> Tip: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`.
	> Tip: The list is automatically extended by reusing the last value for all subsequent turns. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]` (the first turn is 150, and all subsequent turns are 1000).

	> 提示：列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`。
	> 提示：列表长度不足时自动复用最后一个值给后续所有轮次。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]`（第一轮为 150，后续轮次均为 1000）。

Conversation

pjgao commented May 22, 2026

Design & Problem Statement

Problem

Solution Design

New Parameter: --max-turn-tokens

Architecture

Usage Example

Changed Files

Core Logic (4 files)

API Signature Compatibility (5 files)

Documentation (4 files)

Related Issue

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

pjgao commented May 22, 2026

Uh oh!

Yunnglin left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Parameter: `--max-turn-tokens`