Feature Request: 多轮对话压测支持逐轮配置 max_tokens 和自定义 system prompt

## 场景描述

我正在使用 `evalscope perf --multi-turn` 进行多轮对话压测，目标是模拟实际场景下 Agent 工具调用的性能表现。

## 具体问题

实际场景中的 Agent 模型会根据场景工具训练，输出包含工具调用。但用开源模型（如 Qwen）模拟时，开源模型不会吐工具调用，输出长度与实际模型不同。

为了近似模拟实际模型的效果，需要：

1. **第一轮**：system prompt 约 4000 token，用户问题约 20 token，模型输出 **150 token**（模拟工具调用）
2. **第 2~9 轮**：将上一轮输出拼接到历史继续问，每轮模型输出 **150 token**
3. **第 10 轮**：最终输出，需要 **1000 token**

总共 10 轮，上下文逐轮增长。

## 现有机制的不足

### `custom_multi_turn`
- assistant 消息只定义对话结构，实际上下文中始终使用模型的**真实输出**
- `--max-tokens` 是全局参数，**无法逐轮设不同值**（前 9 轮 150 vs 第 10 轮 1000）
- 即使用 `--extra-args {"ignore_eos": true}`，也无法解决逐轮不同 max_tokens 的需求

### `swe_smith`
- 支持 `first_turn_length` / `subsequent_turn_length` 控制每轮输入 token 长度
- 但数据来源是 SWE-agent 轨迹，**无法使用自定义 system prompt 和用户问题**
- 同样无法逐轮控制输出长度

## 期望功能

希望多轮压测模式支持以下至少一种方式：

1. **逐轮 max_tokens 配置**：允许通过参数指定每轮的 max_tokens，例如：
   ```
   --multi-turn-tokens "[150, 150, 150, 150, 150, 150, 150, 150, 150, 1000]"
   ```

2. **增强 custom_multi_turn**：允许在 JSONL 中为每条 user 消息指定该轮的 max_tokens

3. **自定义 system prompt 支持**：允许通过参数覆盖 system prompt，而不是完全依赖数据集内容

## 当前 workaround

目前已自行编写脚本模拟该场景（逐轮调用 API，手动拼接历史，逐轮设置 max_tokens），但无法享受 EvalScope 的压测指标采集（RPS、P99、TTFT、TPOT、KV cache hit 等）。

## 环境信息

- EvalScope 版本：latest
- Python 版本：3.x
- 使用场景：OpenAI-compatible API 压测


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: 多轮对话压测支持逐轮配置 max_tokens 和自定义 system prompt #1358

场景描述

具体问题

现有机制的不足

`custom_multi_turn`

`swe_smith`

期望功能

当前 workaround

环境信息

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature Request: 多轮对话压测支持逐轮配置 max_tokens 和自定义 system prompt #1358

Description

场景描述

具体问题

现有机制的不足

custom_multi_turn

swe_smith

期望功能

当前 workaround

环境信息

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`custom_multi_turn`

`swe_smith`