Skip to content

[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode#1359

Open
pjgao wants to merge 4 commits into
modelscope:mainfrom
pjgao:feat/multi-turn-tokens
Open

[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode#1359
pjgao wants to merge 4 commits into
modelscope:mainfrom
pjgao:feat/multi-turn-tokens

Conversation

@pjgao

@pjgao pjgao commented May 22, 2026

Copy link
Copy Markdown

Design & Problem Statement

Problem

When using evalscope perf --multi-turn to simulate Agent tool-calling performance, the actual model produces tool-call structured outputs of specific lengths. Using an open-source model (e.g., Qwen) for simulation has a fundamental mismatch:

  • The open-source model cannot produce tool-call outputs → output length differs from the real model
  • The existing --max-tokens is a global parameter → cannot set different values per turn
  • Without per-turn control, context growth behavior is inaccurate, making the stress test results non-representative

Concrete scenario: 10-turn conversation simulating tool calls:

  • Turns 1-9: model should output ~150 tokens (simulating tool calls)
  • Turn 10: model should output ~1000 tokens (final complete answer)
  • System prompt: ~4000 tokens, User question: ~20 tokens

Solution Design

New Parameter: --max-turn-tokens

--max-turn-tokens 150 150 150 150 150 150 150 150 150 1000
  • Accepts a list of integers, one per turn (0-based index)
  • When turn_index < list length: uses max_turn_tokens[turn_index]
  • When turn_index >= list length: reuses the last value (auto-extend)
  • Only effective in --multi-turn mode; ignored otherwise
  • Backward compatible: default is None, existing behavior unchanged

Architecture

MultiTurnStrategy._worker()
  └── for turn_idx, turn_delta in enumerate(conversation):
        └── api_plugin.build_request(context, turn_index=turn_idx)
              └── OpenaiPlugin.__compose_query_from_parameter(payload, param, turn_index)
                    └── if max_turn_tokens and turn_index:
                          payload["max_tokens"] = max_turn_tokens[min(turn_index, len-1)]
                        else:
                          payload["max_tokens"] = sample(max_tokens)  # existing behavior

Usage Example

Simulate 10-turn tool-calling performance:

  1. Prepare JSONL (one 10-turn conversation per line):

    [{"role":"system","content":"<4000 token prompt>"},{"role":"user","content":"<20 token question>"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"请给出完整的最终回答"}]

    assistant messages only define structure; replaced by real model outputs at runtime.

  2. Run benchmark:

    evalscope perf \
      --model YOUR_MODEL \
      --url OPENAI_API_COMPAT_URL \
      --api openai \
      --dataset custom_multi_turn \
      --dataset-path tool_call_sim.jsonl \
      --multi-turn \
      --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \
      --number 50 \
      --parallel 10 \
      --extra-args '{"ignore_eos": true}'
Turn max_tokens Behavior
1 150 Initial tool call simulation
2-9 150 Intermediate tool calls
10 1000 Final complete answer

Changed Files

Core Logic (4 files)

File Change
evalscope/perf/arguments.py + max_turn_tokens Pydantic field with validation; + --max-turn-tokens CLI arg
evalscope/perf/core/strategies/multi_turn.py Pass turn_index to api_plugin.build_request()
evalscope/perf/plugin/api/openai_api.py __compose_query_from_parameter: honor max_turn_tokens when turn_index is provided
evalscope/perf/plugin/api/openai_responses_api.py _compose_query_from_parameter: same for Responses API

API Signature Compatibility (5 files)

File Change
evalscope/perf/plugin/api/base.py Add turn_index: Optional[int] = None to abstract build_request
evalscope/perf/plugin/api/dashscope_api.py Accept turn_index (passthrough, no logic change)
evalscope/perf/plugin/api/custom_api.py Accept turn_index (passthrough)
evalscope/perf/plugin/api/openai_embedding_api.py Accept turn_index (passthrough)
evalscope/perf/plugin/api/openai_rerank_api.py Accept turn_index (passthrough)

Documentation (4 files)

File Change
docs/zh/user_guides/stress_test/multi_turn.md New section: 逐轮控制输出长度 + example
docs/en/user_guides/stress_test/multi_turn.md New section: Per-turn Output Length Control + example
docs/zh/user_guides/stress_test/parameters.md New parameter row for --max-turn-tokens
docs/en/user_guides/stress_test/parameters.md New parameter row for --max-turn-tokens

Related Issue

pjgao added 2 commits May 22, 2026 09:04
… multi-turn mode

Support specifying different max_tokens per turn in multi-turn stress test
mode. This is essential for simulating agent tool-calling scenarios where
early turns produce short outputs (e.g., 150 tokens) and the final turn
produces a longer response (e.g., 1000 tokens).

Usage:
  evalscope perf --multi-turn --max-turn-tokens 150 150 1000

If the list is shorter than the actual turn count, the last value is
reused for remaining turns.

Changes:
- Arguments: new max_turn_tokens field with validation and CLI arg
- MultiTurnStrategy: pass turn_index to api_plugin.build_request()
- ApiPluginBase: add turn_index param to build_request signature
- OpenaiPlugin: honor max_turn_tokens when composing request
- OpenAIResponsesPlugin: same for Responses API
- All other API plugins: accept turn_index for signature compat
Update Chinese and English docs for multi_turn and parameters pages:
- multi_turn.md: new section explaining per-turn output length control
  with a concrete tool-call simulation example (150/1000 tokens)
- parameters.md: new row for --max-turn-tokens parameter

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the --max-turn-tokens parameter to enable per-turn control of output lengths during multi-turn performance benchmarks, facilitating more accurate simulations of agent tool-calling behaviors. The changes span documentation, argument validation, and API plugin updates. Review feedback correctly identifies several critical issues: the Optional type hint is used in multiple files without being imported, which will cause runtime errors; the validator for max_turn_tokens is inconsistent with existing token validation logic and lacks type safety for single-integer inputs; and the documentation examples incorrectly describe the logic for extending the token list when it is shorter than the total number of turns.

Comment thread evalscope/perf/plugin/api/openai_api.py
Comment thread evalscope/perf/plugin/api/openai_api.py Outdated
self.tokenizer = None

def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change
def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

output.real_cached_tokens = cached

def _compose_query_from_parameter(self, payload: Dict, param: Arguments) -> Dict:
def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module. Please use Union[int, None] or add the missing import.

Suggested change
def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict:
def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Union[int, None] = None) -> Dict:

super().__init__(param)

def build_request(self, messages: List[Dict], param: Arguments = None) -> Dict:
def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Any is already imported, you can use Any as a fallback or add the missing import.

Suggested change
def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict:

self.tokenizer = None

def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change
def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

self.tokenizer = None

def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.

Suggested change
def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict:
def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict:

Comment on lines +378 to +386
def _validate_max_turn_tokens(cls, v):
if v is None:
return v
if isinstance(v, list):
if not v:
raise ValueError('--max-turn-tokens must contain at least one value')
if any(x < 1 for x in v):
raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')
return v

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are two improvements for this validator:

  1. Inconsistency: The max_tokens validator (line 373) allows values >= 0, while this validator requires >= 1. It's better to allow 0 for consistency, as some APIs might use 0 to indicate a default or metadata-only request.
  2. Type Safety: If the user provides a single integer via code (not CLI), v will be an int, and subsequent logic in openai_api.py (which expects a list) will fail. It's safer to coerce single values into a list.
Suggested change
def _validate_max_turn_tokens(cls, v):
if v is None:
return v
if isinstance(v, list):
if not v:
raise ValueError('--max-turn-tokens must contain at least one value')
if any(x < 1 for x in v):
raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}')
return v
@field_validator('max_turn_tokens', mode='before')
@classmethod
def _validate_max_turn_tokens(cls, v):
if v is None:
return v
if isinstance(v, (int, float)):
v = [int(v)]
if isinstance(v, list):
if not v:
raise ValueError('--max-turn-tokens must contain at least one value')
if any(x < 0 for x in v):
raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}')
return v

| Turns 2-9 | 150 | Simulate intermediate tool calls |
| Turn 10 | 1000 | Final complete answer |

> **Tip**: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example in this tip is incorrect based on the implementation logic. The code reuses the last element of the list for all subsequent turns. Therefore, --max-turn-tokens 150 1000 would result in [150, 1000, 1000, ...] (where turn 0 is 150 and all others are 1000), not the sequence shown in the tip. To achieve the sequence in the tip, the user must provide the full list.

Suggested change
> **Tip**: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`.
> **Tip**: The list is automatically extended by reusing the last value for all subsequent turns. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]` (the first turn is 150, and all subsequent turns are 1000).

| 第 2-9 轮 | 150 | 模拟中间轮工具调用 |
| 第 10 轮 | 1000 | 最终完整回答 |

> **提示**:列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`。

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

此处提示中的示例与代码实现逻辑不符。代码逻辑是复用列表的最后一个元素给后续所有轮次。因此 --max-turn-tokens 150 1000 在 10 轮对话中的效果是 [150, 1000, 1000, ...](即第一轮为 150,后续均为 1000),而非提示中所写的序列。若要实现提示中的效果,用户需要提供完整的列表。

Suggested change
> **提示**列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`
> **提示**列表长度不足时自动复用最后一个值给后续所有轮次。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]`(第一轮为 150,后续轮次均为 1000)

…examples

- Add missing Optional import to openai_api, openai_responses_api,
  dashscope_api, custom_api, openai_embedding_api, openai_rerank_api
- Improve max_turn_tokens validator: allow >= 0 for consistency with
  max_tokens, coerce single int to list for programmatic API
- Fix doc tip examples: correct the list extension behavior description
  in both en and zh multi_turn.md
@pjgao

pjgao commented May 22, 2026

Copy link
Copy Markdown
Author

感谢审查意见,已全部修复(commit ):

✅ 已修复 - 8 处 Optional 导入缺失

openai_api.pyopenai_responses_api.pydashscope_api.pycustom_api.pyopenai_embedding_api.pyopenai_rerank_api.py 中补充了 Optional 导入。base.py 已有该导入,无需修改。

✅ 已修复 - Validator 改进

  1. >= 1 改为 >= 0,与 max_tokens 保持一致
  2. 增加单个 int 自动转 list 的兼容处理,支持编程式 API 调用

✅ 已修复 - 文档 Tip 示例

修正了中英文文档中 --max-turn-tokens 150 1000 的实际展开结果:

  • 原文错误描述为 [150, 150, ..., 1000]
  • 已更正为 [150, 1000, 1000, ..., 1000](正确反映复用最后一个值的逻辑)

- custom_multi_turn: extract tools from JSON data and embed into first turn
- openai_api: extract embedded tools and inject into request payload
- openai_responses_api: same tools support for Responses API
- Supports JSON format: {"messages": [...], "tools": [...]}
- Backward compatible: works without tools definitions
@pjgao pjgao changed the title feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode [WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode May 22, 2026

@Yunnglin Yunnglin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! A few remaining items before this is ready to merge:

1. Bug: _iter_jsonl signature mismatch

_iter_jsonl is defined without parameters:

def _iter_jsonl(self) -> Iterator[Tuple[List[Message], Optional[List[Dict]]]]:

but called with path in two places:

yield from self._iter_jsonl(path)

This will raise TypeError at runtime. Since the method already reads path from self.query_parameters.dataset_path internally, the call sites should drop the argument:

yield from self._iter_jsonl()

2. Pre-commit CI still failing (3 hooks)

  • isort: import order in custom_api.py, openai_api.py, openai_embedding_api.py, openai_rerank_api.py, openai_responses_api.py
  • yapf: formatting issues
  • double-quote-string-fixer: double quotes in openai_responses_api.py, openai_api.py, custom.py

You can run pre-commit run --all-files locally to auto-fix most of these.

3. Title

Don't forget to remove [WIP] from the title when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: 多轮对话压测支持逐轮配置 max_tokens 和自定义 system prompt

2 participants