[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode#1359
[WIP] feat(perf): add --max-turn-tokens for per-turn max_tokens override in multi-turn mode#1359pjgao wants to merge 4 commits into
Conversation
… multi-turn mode Support specifying different max_tokens per turn in multi-turn stress test mode. This is essential for simulating agent tool-calling scenarios where early turns produce short outputs (e.g., 150 tokens) and the final turn produces a longer response (e.g., 1000 tokens). Usage: evalscope perf --multi-turn --max-turn-tokens 150 150 1000 If the list is shorter than the actual turn count, the last value is reused for remaining turns. Changes: - Arguments: new max_turn_tokens field with validation and CLI arg - MultiTurnStrategy: pass turn_index to api_plugin.build_request() - ApiPluginBase: add turn_index param to build_request signature - OpenaiPlugin: honor max_turn_tokens when composing request - OpenAIResponsesPlugin: same for Responses API - All other API plugins: accept turn_index for signature compat
Update Chinese and English docs for multi_turn and parameters pages: - multi_turn.md: new section explaining per-turn output length control with a concrete tool-call simulation example (150/1000 tokens) - parameters.md: new row for --max-turn-tokens parameter
There was a problem hiding this comment.
Code Review
This pull request introduces the --max-turn-tokens parameter to enable per-turn control of output lengths during multi-turn performance benchmarks, facilitating more accurate simulations of agent tool-calling behaviors. The changes span documentation, argument validation, and API plugin updates. Review feedback correctly identifies several critical issues: the Optional type hint is used in multiple files without being imported, which will cause runtime errors; the validator for max_turn_tokens is inconsistent with existing token validation logic and lacks type safety for single-integer inputs; and the documentation examples incorrectly describe the logic for extending the token list when it is shorter than the total number of turns.
| self.tokenizer = None | ||
|
|
||
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None) -> Dict: | ||
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: |
There was a problem hiding this comment.
The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: | |
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict: |
| output.real_cached_tokens = cached | ||
|
|
||
| def _compose_query_from_parameter(self, payload: Dict, param: Arguments) -> Dict: | ||
| def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict: |
There was a problem hiding this comment.
The Optional type hint is used here but it is not imported from the typing module. Please use Union[int, None] or add the missing import.
| def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Optional[int] = None) -> Dict: | |
| def _compose_query_from_parameter(self, payload: Dict, param: Arguments, turn_index: Union[int, None] = None) -> Dict: |
| super().__init__(param) | ||
|
|
||
| def build_request(self, messages: List[Dict], param: Arguments = None) -> Dict: | ||
| def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: |
There was a problem hiding this comment.
The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Any is already imported, you can use Any as a fallback or add the missing import.
| def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: | |
| def build_request(self, messages: List[Dict], param: Arguments = None, turn_index: Any = None) -> Dict: |
| self.tokenizer = None | ||
|
|
||
| def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None) -> Dict: | ||
| def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: |
There was a problem hiding this comment.
The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.
| def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: | |
| def build_request(self, messages: Union[List[Dict], str, List[str]], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict: |
| self.tokenizer = None | ||
|
|
||
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None) -> Dict: | ||
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: |
There was a problem hiding this comment.
The Optional type hint is used here but it is not imported from the typing module in this file. This will cause a NameError at runtime. Since Union is already imported, you can use Union[int, None] as a replacement.
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Optional[int] = None) -> Dict: | |
| def build_request(self, messages: Union[List[Dict], str, Dict], param: Arguments = None, turn_index: Union[int, None] = None) -> Dict: |
| def _validate_max_turn_tokens(cls, v): | ||
| if v is None: | ||
| return v | ||
| if isinstance(v, list): | ||
| if not v: | ||
| raise ValueError('--max-turn-tokens must contain at least one value') | ||
| if any(x < 1 for x in v): | ||
| raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}') | ||
| return v |
There was a problem hiding this comment.
There are two improvements for this validator:
- Inconsistency: The
max_tokensvalidator (line 373) allows values>= 0, while this validator requires>= 1. It's better to allow0for consistency, as some APIs might use0to indicate a default or metadata-only request. - Type Safety: If the user provides a single integer via code (not CLI),
vwill be anint, and subsequent logic inopenai_api.py(which expects a list) will fail. It's safer to coerce single values into a list.
| def _validate_max_turn_tokens(cls, v): | |
| if v is None: | |
| return v | |
| if isinstance(v, list): | |
| if not v: | |
| raise ValueError('--max-turn-tokens must contain at least one value') | |
| if any(x < 1 for x in v): | |
| raise ValueError(f'--max-turn-tokens values must be >= 1, got {v}') | |
| return v | |
| @field_validator('max_turn_tokens', mode='before') | |
| @classmethod | |
| def _validate_max_turn_tokens(cls, v): | |
| if v is None: | |
| return v | |
| if isinstance(v, (int, float)): | |
| v = [int(v)] | |
| if isinstance(v, list): | |
| if not v: | |
| raise ValueError('--max-turn-tokens must contain at least one value') | |
| if any(x < 0 for x in v): | |
| raise ValueError(f'--max-turn-tokens values must be >= 0, got {v}') | |
| return v |
| | Turns 2-9 | 150 | Simulate intermediate tool calls | | ||
| | Turn 10 | 1000 | Final complete answer | | ||
|
|
||
| > **Tip**: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`. |
There was a problem hiding this comment.
The example in this tip is incorrect based on the implementation logic. The code reuses the last element of the list for all subsequent turns. Therefore, --max-turn-tokens 150 1000 would result in [150, 1000, 1000, ...] (where turn 0 is 150 and all others are 1000), not the sequence shown in the tip. To achieve the sequence in the tip, the user must provide the full list.
| > **Tip**: The list is automatically extended by reusing the last value. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`. | |
| > **Tip**: The list is automatically extended by reusing the last value for all subsequent turns. For example, `--max-turn-tokens 150 1000` in a 10-turn conversation results in `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]` (the first turn is 150, and all subsequent turns are 1000). |
| | 第 2-9 轮 | 150 | 模拟中间轮工具调用 | | ||
| | 第 10 轮 | 1000 | 最终完整回答 | | ||
|
|
||
| > **提示**:列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`。 |
There was a problem hiding this comment.
此处提示中的示例与代码实现逻辑不符。代码逻辑是复用列表的最后一个元素给后续所有轮次。因此 --max-turn-tokens 150 1000 在 10 轮对话中的效果是 [150, 1000, 1000, ...](即第一轮为 150,后续均为 1000),而非提示中所写的序列。若要实现提示中的效果,用户需要提供完整的列表。
| > **提示**:列表长度不足时自动复用最后一个值。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]`。 | |
| > **提示**:列表长度不足时自动复用最后一个值给后续所有轮次。例如 `--max-turn-tokens 150 1000` 在 10 轮对话中效果为 `[150, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000]`(第一轮为 150,后续轮次均为 1000)。 |
…examples - Add missing Optional import to openai_api, openai_responses_api, dashscope_api, custom_api, openai_embedding_api, openai_rerank_api - Improve max_turn_tokens validator: allow >= 0 for consistency with max_tokens, coerce single int to list for programmatic API - Fix doc tip examples: correct the list extension behavior description in both en and zh multi_turn.md
|
感谢审查意见,已全部修复(commit ): ✅ 已修复 - 8 处 在 ✅ 已修复 - Validator 改进
✅ 已修复 - 文档 Tip 示例 修正了中英文文档中
|
- custom_multi_turn: extract tools from JSON data and embed into first turn
- openai_api: extract embedded tools and inject into request payload
- openai_responses_api: same tools support for Responses API
- Supports JSON format: {"messages": [...], "tools": [...]}
- Backward compatible: works without tools definitions
Yunnglin
left a comment
There was a problem hiding this comment.
Thanks for the updates! A few remaining items before this is ready to merge:
1. Bug: _iter_jsonl signature mismatch
_iter_jsonl is defined without parameters:
def _iter_jsonl(self) -> Iterator[Tuple[List[Message], Optional[List[Dict]]]]:but called with path in two places:
yield from self._iter_jsonl(path)This will raise TypeError at runtime. Since the method already reads path from self.query_parameters.dataset_path internally, the call sites should drop the argument:
yield from self._iter_jsonl()2. Pre-commit CI still failing (3 hooks)
- isort: import order in
custom_api.py,openai_api.py,openai_embedding_api.py,openai_rerank_api.py,openai_responses_api.py - yapf: formatting issues
- double-quote-string-fixer: double quotes in
openai_responses_api.py,openai_api.py,custom.py
You can run pre-commit run --all-files locally to auto-fix most of these.
3. Title
Don't forget to remove [WIP] from the title when ready.
Design & Problem Statement
Problem
When using
evalscope perf --multi-turnto simulate Agent tool-calling performance, the actual model produces tool-call structured outputs of specific lengths. Using an open-source model (e.g., Qwen) for simulation has a fundamental mismatch:--max-tokensis a global parameter → cannot set different values per turnConcrete scenario: 10-turn conversation simulating tool calls:
Solution Design
New Parameter:
--max-turn-tokensturn_index< list length: usesmax_turn_tokens[turn_index]turn_index>= list length: reuses the last value (auto-extend)--multi-turnmode; ignored otherwiseNone, existing behavior unchangedArchitecture
Usage Example
Simulate 10-turn tool-calling performance:
Prepare JSONL (one 10-turn conversation per line):
[{"role":"system","content":"<4000 token prompt>"},{"role":"user","content":"<20 token question>"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"继续"},{"role":"assistant","content":"x"},{"role":"user","content":"请给出完整的最终回答"}]Run benchmark:
evalscope perf \ --model YOUR_MODEL \ --url OPENAI_API_COMPAT_URL \ --api openai \ --dataset custom_multi_turn \ --dataset-path tool_call_sim.jsonl \ --multi-turn \ --max-turn-tokens 150 150 150 150 150 150 150 150 150 1000 \ --number 50 \ --parallel 10 \ --extra-args '{"ignore_eos": true}'max_tokensChanged Files
Core Logic (4 files)
evalscope/perf/arguments.pymax_turn_tokensPydantic field with validation; +--max-turn-tokensCLI argevalscope/perf/core/strategies/multi_turn.pyturn_indextoapi_plugin.build_request()evalscope/perf/plugin/api/openai_api.py__compose_query_from_parameter: honormax_turn_tokenswhenturn_indexis providedevalscope/perf/plugin/api/openai_responses_api.py_compose_query_from_parameter: same for Responses APIAPI Signature Compatibility (5 files)
evalscope/perf/plugin/api/base.pyturn_index: Optional[int] = Noneto abstractbuild_requestevalscope/perf/plugin/api/dashscope_api.pyturn_index(passthrough, no logic change)evalscope/perf/plugin/api/custom_api.pyturn_index(passthrough)evalscope/perf/plugin/api/openai_embedding_api.pyturn_index(passthrough)evalscope/perf/plugin/api/openai_rerank_api.pyturn_index(passthrough)Documentation (4 files)
docs/zh/user_guides/stress_test/multi_turn.mddocs/en/user_guides/stress_test/multi_turn.mddocs/zh/user_guides/stress_test/parameters.md--max-turn-tokensdocs/en/user_guides/stress_test/parameters.md--max-turn-tokensRelated Issue