diff --git a/games/game_werewolves/utils.py b/games/game_werewolves/utils.py index 34306970..4b817ccd 100644 --- a/games/game_werewolves/utils.py +++ b/games/game_werewolves/utils.py @@ -6,9 +6,9 @@ import numpy as np from agentscope.agent import AgentBase, ReActAgent from agentscope.message import Msg -from prompt import ( +from prompt import ( # pylint: disable=no-name-in-module EnglishPrompts as Prompts, -) # pylint: disable=no-name-in-module +) MAX_GAME_ROUND = 30 MAX_DISCUSSION_ROUND = 3 diff --git a/tuner/data_augment/README.md b/tuner/data_augment/README.md index 9f3eb56d..bcb417f3 100644 --- a/tuner/data_augment/README.md +++ b/tuner/data_augment/README.md @@ -4,18 +4,13 @@ This example demonstrates how to use **AgentScope-Tuner** to enhance a math prob ## Task Setting -We use the foundational [math-agent example](https://github.com/agentscope-ai/agentscope-samples/blob/main/tuner/math_agent/main.py) as our baseline to demonstrate the data enhancement capabilities. Notably, these data-centric techniques are generic and customizable, making them adaptable to other agent workflows. +We use the foundational [math-agent example](https://github.com/agentscope-ai/agentscope-samples/blob/main/tuner/math_agent/main.py) as our baseline. The agent is a **`ReActAgent`** that solves mathematical reasoning problems through step-by-step reasoning. -### Agent Goal and Type -The agent's objective is to solve mathematical reasoning problems, learning to produce a correct final answer through a step-by-step thought process. The agent is implemented as a **`ReActAgent`**, which follows a reasoning-acting loop to solve tasks iteratively. - -### Objective of the Data-Centric Approach - -Training can be inefficient if tasks are too easy or too hard. This example addresses this by providing **selectors** to dynamically select tasks using **data feedback**. This empowers users to explore and implement their own data-centric strategies, such as focusing on "productively challenging" samples, to maximize training efficiency. +Training can be inefficient if tasks are too easy or too hard. This example demonstrates how to use **task selectors** to dynamically select tasks based on **data feedback**, focusing on "productively challenging" samples to maximize training efficiency. These data-centric techniques are generic and adaptable to other agent workflows. ## Dataset Preparation -To enable difficulty-based sampling, our training data needs to include features that represent the "difficulty" of each task. +To enable difficulty-based sampling, the training data must include difficulty features (e.g., pass rates from LLMs). 1. **Base Dataset**: You can use any standard math problem dataset. A good example is the math data in [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k), which comes pre-annotated with pass rates from different LLMs, serving as direct difficulty features. 2. **Build Your Own Features**: If you use your own dataset, you can generate these features by pre-running several models of varying capabilities and recording their pass rates. This can be done within the [**Trinity-RFT**](https://github.com/agentscope-ai/Trinity-RFT/pull/440) framework. @@ -34,7 +29,7 @@ Leveraging the powerful data processing capabilities of **Trinity-RFT**, **Agent #### Task Selector -The `Task Selector` determines how samples are selected from a dataset. It can be configured directly in `Yaml Config`. +The `Task Selector` determines how samples are selected from a dataset. It can be configured directly in configuration YAML files. - **Built-in Selectors**: - `sequential`: Samples are selected in a fixed order. @@ -43,7 +38,7 @@ The `Task Selector` determines how samples are selected from a dataset. It can b - `offline_easy2hard`: Samples are sorted by a predefined feature for curriculum learning. - `difficulty_based` (Customized): An adaptive sampler based on task difficulty. -> For more details on `Task Selector`, including how to implement a custom selector based on feedback signals, please refer to **Trinity-RFT**'s **[Selector Development Guide](https://github.com/agentscope-ai/Trinity-RFT/blob/main/docs/sphinx_doc/source/tutorial/develop_selector.md)**. +> For more details on `Task Selector`, including how to implement a custom selector based on feedback signals, please refer to **Trinity-RFT**'s **[Selector Development Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_selector.html)**. #### Data Processor @@ -51,7 +46,7 @@ The `Data Processor` allows for real-time processing of **Task** and **Experienc For example, the `difficulty_based` selector requires a `pass_rate_calculator` operator to compute the agent's success rate for each task. This feedback is then used to adjust the sampling strategy. -> For more details on `Data Processor`, please refer to **Trinity-RFT**'s **[Operator Development Guide](https://github.com/agentscope-ai/Trinity-RFT/blob/main/docs/sphinx_doc/source/tutorial/develop_operator.md)**. +> For more details on `Data Processor`, please refer to **Trinity-RFT**'s **[Operator Development Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/develop_operator.html)**. ### Configuring the Experiments @@ -147,7 +142,9 @@ python main.py --config config_difficulty.yaml The following results compare the performance of the `difficulty-based` selection strategy (red line, bots) against a standard `random` selection strategy (black line, random). -![Training Result Image](./training_result.jpg) +
+ Training Result Image +
### Training Reward Curve diff --git a/tuner/data_augment/README_zh.md b/tuner/data_augment/README_zh.md new file mode 100644 index 00000000..97ec32ad --- /dev/null +++ b/tuner/data_augment/README_zh.md @@ -0,0 +1,153 @@ +# 使用数据增强策略训练数学智能体 + +本示例演示了如何使用 **AgentScope-Tuner** 训练数学问题求解智能体。我们将重点利用**以数据为中心**的功能,例如 `difficulty_based` 任务选择器,以提高数据利用率和训练效率。 + +## 任务设置 + +我们使用基础的 [math-agent 示例](https://github.com/agentscope-ai/agentscope-samples/blob/main/tuner/math_agent/main.py) 作为基线。智能体是 **`ReActAgent`**,通过逐步推理解决数学推理问题。 + +如果任务太容易或太难,训练可能会效率低下。本示例演示如何使用**任务选择器**基于**数据反馈**动态选择任务,专注于"具有挑战性"的样本以最大化训练效率。这些以数据为中心的技术是通用的,可适应其他智能体工作流。 + +## 数据集准备 + +为启用基于难度的采样,训练数据必须包含难度特征(如 LLM 的通过率)。 + +1. **基础数据集**:您可以使用任何标准的数学问题数据集。一个很好的例子是 [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k) 中的数学数据,它预先标注了来自不同 LLM 的通过率,作为直接的难度特征。 +2. **构建您自己的特征**:如果您使用自己的数据集,可以通过预先运行几个不同能力的模型并记录它们的通过率来生成这些特征。这可以在 [**Trinity-RFT**](https://github.com/agentscope-ai/Trinity-RFT/pull/440) 框架内完成。 +3. **数据格式**:最终数据集应为 HuggingFace 格式。在此示例中,数据将根据[工作流](https://github.com/agentscope-ai/agentscope-samples/blob/main/tuner/math_agent/main.py)转换为 *GSM8K 格式*。除了任务内容外,它还必须包含您定义的难度特征列(例如 `qwen2.5_7b_pass_rate`)。 +4. **示例数据准备**:我们为此示例提供了一个脚本。只需执行 `python prepare_data.py` 即可生成所需的数据集。 + +## 代码实现 + +本示例采用 [math-agent 示例](https://github.com/agentscope-ai/agentscope-samples/blob/main/tuner/math_agent/main.py) 的 `run_react_agent` 和 `gsm8k_judge` 作为 `workflow_func` 和 `judge_func`,说明可以在不改变核心智能体逻辑的情况下应用训练策略。 + +### 以数据为中心功能的设计 + +利用 **Trinity-RFT** 强大的数据处理能力,**AgentScope-Tuner** 为任务选择和经验处理等高级操作提供了接口。 + +#### 任务选择器 + +`Task Selector` 决定如何从数据集中选择样本。它可以直接在 YAML 配置文件中配置。 + +- **内置选择器**: + - `sequential`:按固定顺序选择样本。 + - `shuffle`:在每个 epoch 开始时打乱数据集。 + - `random`:为每个批次随机选择样本(有放回)。 + - `offline_easy2hard`:按预定义特征对样本进行排序,用于课程学习。 + - `difficulty_based`(自定义):基于任务难度的自适应采样器。 + +> 有关 `Task Selector` 的更多详细信息,包括如何基于反馈信号实现自定义选择器,请参阅 **Trinity-RFT** 的 **[Selector 开发指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_selector.html)**。 + +#### 数据处理器 + +`Data Processor` 允许在训练期间实时处理**任务**(task)和**经验**(experience),支持计算反馈指标、数据增强或过滤等操作。 + +例如,`difficulty_based` 选择器需要一个 `pass_rate_calculator` 操作符来计算智能体对每个任务的成功率。然后使用此反馈来调整采样策略。 + +> 有关 `Data Processor` 的更多详细信息,请参阅 **Trinity-RFT** 的 **[Operator 开发指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/develop_operator.html)**。 + + +### 配置实验 + +为了保持清晰和简洁,我们建议在 YAML 配置文件中定义所有数据特定参数,包括数据集路径和任务选择器。 + +我们提供两个配置文件,用于比较基线 `random` 选择器与 `difficulty_based` 选择器。 + +**实验 1:使用随机选择器的基线(`config_random.yaml`)** + +在 `config_random.yaml` 中,我们在 `buffer.explorer_input.taskset` 下配置用于随机采样的 `task_selector`。 + +```yaml +# 在 config_random.yaml 中 +buffer: + # ... + explorer_input: + taskset: # 训练数据 + path: "path/to/your/augmented/math_data" + split: "train" + task_selector: + selector_type: random # 任务选择策略 +``` + +**实验 2:使用基于难度选择器的高级训练(`config_difficulty.yaml`)** + +在 `config_difficulty.yaml` 中,我们将 `task_selector` 切换为 `difficulty_based` 并提供其特定参数。请注意,此配置还启用了反馈所需的 `pass_rate_calculator`。 + +```yaml +# 在 config_difficulty.yaml 中 + +# 启用计算器为选择器提供反馈 +data_processor: + experience_pipeline: + operators: + - name: pass_rate_calculator + +buffer: + # ... + explorer_input: + taskset: # 训练数据 + path: "path/to/your/augmented/math_data" + split: "train" + task_selector: + selector_type: difficulty_based # 任务选择策略 + feature_keys: [ "qwen2.5_7b_pass_rate", "qwen3_30b_pass_rate" ] + kwargs: # 选择算法的超参数 + m: 8 + # ... +``` + +> 本示例中的 `difficulty_based` 选择器是 ***BOTS*** 算法的实现。有关其内部工作原理的详细信息,请参阅 [***BOTS 论文***](https://arxiv.org/abs/2510.26374) 及其 [***教程***](https://github.com/agentscope-ai/Trinity-RFT/blob/main/examples/bots/README.md)。 + +## 如何运行 + +### 步骤 1:前置要求 + +确保您已按照[指南](https://github.com/agentscope-ai/agentscope-samples/blob/main/tuner/math_agent/README_zh.md#how-to-run)安装了 **AgentScope** 和 **Trinity-RFT**。 + +### 步骤 2:准备数据集 + +运行数据准备脚本。确保之后更新 `config_random.yaml` 和 `config_difficulty.yaml` 中的数据集路径。 + +```bash +python prepare_data.py +``` + +### 步骤 3:启动 Ray 集群 + +对于分布式训练,启动 Ray 集群。 + +```bash +# 单节点 +ray start --head +``` + +### 步骤 4:运行训练 + +您现在可以运行基线或基于难度的训练实验。 + +- **使用随机选择器运行基线实验:** + +```bash +python main.py --config config_random.yaml +``` + +- **使用基于难度的选择器运行实验:** +```bash +python main.py --config config_difficulty.yaml +``` + +## 实验结果 + +以下结果比较了 `difficulty-based` 选择策略(红线,bots)与标准 `random` 选择策略(黑线,random)的性能。 + +
+ 训练结果图 +
+ +### 训练奖励曲线 + +左侧图表显示了训练期间的 rollout 准确率。可以看出,随机策略采样的任务对模型来说似乎很困难,准确率保持在 0.2 以下。相比之下,使用难度选择器会产生更高的平均准确率,表明智能体正在处理更多可以成功解决的任务。 + +### 在 AIME-24 上的评估 + +为了比较,我们在 AIME-24 基准上评估了两种选择策略。右侧图表显示,基于难度的方法在性能上表现出更好的上升趋势。 diff --git a/tuner/data_augment/config_difficulty.yaml b/tuner/data_augment/config_difficulty.yaml index 422dca8a..1b83b00d 100644 --- a/tuner/data_augment/config_difficulty.yaml +++ b/tuner/data_augment/config_difficulty.yaml @@ -71,4 +71,4 @@ explorer: trainer: save_interval: 100 # Save checkpoint every N steps use_dynamic_bsz: true # Use dynamic batch size - ulysses_sequence_parallel_size: 1 # Sequence parallel size for Ulysses \ No newline at end of file + ulysses_sequence_parallel_size: 1 # Sequence parallel size for Ulysses diff --git a/tuner/data_augment/config_random.yaml b/tuner/data_augment/config_random.yaml index f80d524e..78bf24d5 100644 --- a/tuner/data_augment/config_random.yaml +++ b/tuner/data_augment/config_random.yaml @@ -59,4 +59,4 @@ explorer: trainer: save_interval: 100 # Save checkpoint every N steps use_dynamic_bsz: true # Use dynamic batch size - ulysses_sequence_parallel_size: 1 # Sequence parallel size for Ulysses \ No newline at end of file + ulysses_sequence_parallel_size: 1 # Sequence parallel size for Ulysses diff --git a/tuner/email_search/README.md b/tuner/email_search/README.md index d976b0d6..aeedb9d7 100644 --- a/tuner/email_search/README.md +++ b/tuner/email_search/README.md @@ -189,9 +189,13 @@ The training results show improvements in agent performance over training iterat - **Train reward**: The average reward on training samples increases as the agent learns better strategies - **Rollout accuracy**: The average accuracy on rollout samples increases as the agent learns better strategies -![Training Rewards](./critic_reward_mean.png) +
+ Training Rewards +
-![Rollout Accuracy](./rollout_accuracy_mean.png) +
+ Rollout Accuracy +
### Concrete Example diff --git a/tuner/email_search/README_zh.md b/tuner/email_search/README_zh.md index f5e409cd..4c920ae3 100644 --- a/tuner/email_search/README_zh.md +++ b/tuner/email_search/README_zh.md @@ -189,9 +189,13 @@ async def email_search_judge( - **训练奖励**:训练样本上的平均奖励随着智能体学习更好的策略而增加 - **Rollout 准确度**:Rollout 样本上的平均准确度随着智能体学习更好的策略而增加 -![Training Rewards](./critic_reward_mean.png) +
+ Training Rewards +
-![Rollout Accuracy](./rollout_accuracy_mean.png) +
+ Rollout Accuracy +
### 具体示例 diff --git a/tuner/frozen_lake/README.md b/tuner/frozen_lake/README.md index dca3b94c..33bdb965 100644 --- a/tuner/frozen_lake/README.md +++ b/tuner/frozen_lake/README.md @@ -255,7 +255,9 @@ The training will start and you can monitor the progress through the logs. Check The reward curve during training shows the agent's learning progress: -![reward](./critic_rewards_mean.png) +
+ reward +
The training reward typically increases over epochs as the agent learns to navigate the frozen lake more effectively. diff --git a/tuner/frozen_lake/README_zh.md b/tuner/frozen_lake/README_zh.md index 6b54c6fd..a91a01f1 100644 --- a/tuner/frozen_lake/README_zh.md +++ b/tuner/frozen_lake/README_zh.md @@ -234,7 +234,9 @@ python main.py 训练过程中的奖励曲线显示智能体的学习进度: -![reward](./critic_rewards_mean.png) +
+ reward +
训练奖励通常随着智能体学习更有效地导航冰湖而随训练轮次增加。 diff --git a/tuner/learn_to_ask/README.md b/tuner/learn_to_ask/README.md index b53f3841..6a204b49 100644 --- a/tuner/learn_to_ask/README.md +++ b/tuner/learn_to_ask/README.md @@ -30,7 +30,7 @@ Key files: ## Dataset Preparation > [!NOTE] -> In this example, we use an open-source dataset directly for training. In practice, however, you would typically start by collecting interaction logs between your deployed agent and users. After filtering these raw logs to curate a high-quality dataset, you can follow the same pipeline to enhance your agent’s proactive capabilities using AgentTune. Happy tuning! +> In this example, we use an open-source dataset directly for training. In practice, however, you would typically start by collecting interaction logs between your deployed agent and users. After filtering these raw logs to curate a high-quality dataset, you can follow the same pipeline to enhance your agent’s proactive capabilities using AgentScope-Tuner. Happy tuning! ### 1.1 Download the Dataset Download the **[RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv)** dataset (in `.jsonl` format). @@ -183,50 +183,33 @@ async def learn2ask_judge( response: Msg, auxiliary_models: Dict[str, OpenAIChatModel], ) -> JudgeOutput: - assert ( - len(auxiliary_models) == 1 - ), "Please provide only one `auxiliary_models` for `learn_to_ask`." - response_text = response.get_text_content() - action_truth = ( - task["decision_truth"] if "decision_truth" in task else "continue" - ) - + action_truth = task.get("decision_truth", "continue") action_response = "stop" if "" in response_text else "continue" - if action_truth == action_response: - action_score = 1.0 - if action_truth == "continue": - score_dict = await llm_reward( - task=task, - response=response_text, - auxiliary_models=auxiliary_models, # LLM-as-a-Judge - ) - if score_dict != {}: - format_score = float(score_dict.get("format_score", 0.0)) - content_score = float(score_dict.get("content_score", 0.0)) - else: - format_score, content_score = 0.0, 0.0 - else: - content_score = 1.0 - format_score = 1.0 if response_text == "" else 0.0 + + # Calculate action accuracy score + action_score = 1.0 if action_truth == action_response else 0.0 + + # Calculate format and content scores + if action_score == 1.0 and action_truth == "continue": + # Use LLM-as-a-Judge to evaluate question quality + score_dict = await llm_reward(task, response_text, auxiliary_models) + format_score = float(score_dict.get("format_score", 0.0)) + content_score = float(score_dict.get("content_score", 0.0)) + elif action_score == 1.0: # stop action + content_score, format_score = 1.0, (1.0 if response_text == "" else 0.0) else: - action_score, format_score, content_score = 0.0, 0.0, 0.0 + format_score = content_score = 0.0 - if TRAIN_MODE == "Ra+Rs": # the default setting - final_reward = ( - action_score * (1 + 2 * content_score) + format_score - if FUSION_MODE != "sum" - else action_score + content_score + format_score - ) - elif TRAIN_MODE == "Ra": # for Ra only (without Rs) + # Combine final reward based on training mode + if TRAIN_MODE == "Ra+Rs": # Default: action + symptom rewards + final_reward = action_score * (1 + 2 * content_score) + format_score + elif TRAIN_MODE == "Ra": # Action reward only final_reward = 2 * content_score + format_score - else: # for Rs only (without Ra) + else: # Symptom reward only final_reward = action_score * 3 + format_score - return JudgeOutput( - reward=final_reward, - metrics={"reward": final_reward}, - ) + return JudgeOutput(reward=final_reward, metrics={"reward": final_reward}) ``` This reward function considers: @@ -322,9 +305,9 @@ python tuner/learn_to_ask/data_prepare/3_rollout_then_evaluate.py \ We compared three approaches: - **Base model**: `Qwen2.5-7B-Instruct` (no fine-tuning) - **Trinity**: Direct response generation -- **As-tune (Learn2Ask)**: Uses a ReAct agent for proactive questioning +- **AgentScope-Tuner (Learn2Ask)**: Uses a ReAct agent for proactive questioning -| Metric | Base Model | Trinity | As-tune (Learn2Ask) | +| Metric | Base Model | Trinity | AgentScope-Tuner (Learn2Ask) | |--------------------------------------|-----------:|--------:|--------------------:| | Avg. continue content | 0.436 | 0.496 | 0.509 | | Win rate (continue content) | 0.122 | 0.246 | 0.224 | @@ -334,9 +317,11 @@ We compared three approaches: | **Response format quality** | 0.376 | 0.713 | 0.882 | | **Total reward** | 1.281 | 3.078 | 3.237 | -![Training Curves](./learn2ask.png) +
+ Training Curves +
-> ✅ **Key insight**: Learn2Ask (As-tune) achieves the highest overall performance by teaching the model **when and what to ask**—making it truly proactive. +> ✅ **Key insight**: Learn2Ask (AgentScope-Tuner) achieves the highest overall performance by teaching the model **when and what to ask**—making it truly proactive. ### Concrete Example diff --git a/tuner/learn_to_ask/README_zh.md b/tuner/learn_to_ask/README_zh.md new file mode 100644 index 00000000..4b6d23a5 --- /dev/null +++ b/tuner/learn_to_ask/README_zh.md @@ -0,0 +1,391 @@ +# 使用 AgentScope-Tuner 训练 Learn2Ask + +本指南演示了如何使用来自 [Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs](https://arxiv.org/abs/2510.25441) 的 **Learn2Ask** 方法训练主动式 LLM。 + +--- + +## 任务设置 + +在此示例中,给定用户的主诉,医疗助手智能体主动提出有针对性的问题,以收集足够的症状信息,从而全面评估用户的健康状况。查询过程应该高效:智能体必须优化问题质量,并在收集的信息足以进行后续临床评估或决策时立即终止访谈。 +这里我们使用 `ReActAgent` 来完成此任务,不需要工具。 + +--- + +## 硬件要求 + +- **使用 GPU 训练**:至少需要 **8 个 H20 GPU**(或同等配置)。 +- **不使用 GPU 训练**:您可以使用 **[Tinker](https://thinkingmachines.ai/tinker/)**,无需任何 GPU。 + +> 💡 所有代码和配置文件位于: +> `tuner/learn_to_ask/` + +关键文件: +- 工作流和训练:`tuner/learn_to_ask/main.py` +- 提示词:`tuner/learn_to_ask/prompt.py` +- 训练配置:`tuner/learn_to_ask/config.yaml` +- 数据准备脚本:`tuner/learn_to_ask/data_prepare/` + +--- + +## 数据集准备 + +> [!NOTE] +> 在此示例中,我们直接使用开源数据集进行训练。然而,在实践中,您通常需要先收集已部署智能体与用户之间的交互日志。在过滤这些原始日志以整理高质量数据集后,您可以遵循相同的流程,使用 AgentScope-Tuner 增强智能体的主动能力。祝调优愉快! + +### 1.1 下载数据集 +下载 **[RealMedConv](https://huggingface.co/datasets/datajuicer/RealMedConv)** 数据集(`.jsonl` 格式)。 +您可以使用以下 Python 脚本下载数据集: + +```python +from huggingface_hub import snapshot_download + +# 下载到本地目录,例如 `./tuner/learn_to_ask/data` +local_dir = "./tuner/learn_to_ask/data" +snapshot_download( + repo_id="datajuicer/RealMedConv", + repo_type="dataset", + local_dir=local_dir, +) +``` + +`train_origin.jsonl`(或 `test_origin.jsonl`)中的每一行代表一个完整的医患对话日志,如下所示: + +```json +{ + "session_id": 35310, + "diagn": "Upper Respiratory Tract Infection", + "messages": [ + {"role": "user", "content": "Sore throat, phlegm, red eyes, cough, hoarse voice"}, + {"role": "user", "content": "I took Amoxicillin"}, + ... + {"role": "assistant", "content": ""} + ] +} +``` + +### 1.2 预处理数据 +您需要将原始对话日志转换为训练样本。这涉及两个步骤: + +#### 🔹 步骤 A:分割对话并提取标签 +将每个对话分割为**context–future pairs**,并从后续内容中提取真实症状信息(`info_truth`)。 + +```bash +python tuner/learn_to_ask/data_prepare/1_info_extract_pipeline.py \ + --input_file /path/to/RealMedConv/train.jsonl \ + --output_file tuner/learn_to_ask/data_raw/train_processed.jsonl \ + --model_path Qwen/Qwen2.5-32B-Instruct +``` + +#### 🔹 步骤 B:构建最终训练数据集 +将处理后的样本转换为用于训练/测试的最终格式。 + +```bash +python tuner/learn_to_ask/data_prepare/2_build_dataset.py \ + --input_file tuner/learn_to_ask/data_raw/train_processed.jsonl \ + --output_file tuner/learn_to_ask/data/train.jsonl +``` + +--- + +### 工作原理:Context–Future pairsSegmentation + +对于对话中的每一轮,我们创建一个样本,包含: +- `messages`:到该点为止的**已观察对话历史**(context)。 +- `remaining_chat`:该点之后发生的**所有内容**(future)。 +- 唯一 ID:`cid = {session_id}_{turn_index}` + +示例输出: +```json +{ + "cid": "35310_7", + "session_id": "35310", + "diagn": "Upper Respiratory Tract Infection", + "messages": [ ... up to turn 7 ... ], + "remaining_chat": [ ... all future messages ... ] +} +``` + +### 提取真实标签 + +从 `remaining_chat` 中,我们自动推导出两个关键标签: +- `decision_truth`:助手应该继续提问(`"continue"`)还是停止(`"stop"`)? +- `info_truth`:后续提到的结构化症状列表(用于在训练期间计算奖励信号)。 + +示例: +```json +{ + "decision_truth": "continue", + "info_truth": "Symptom: sore throat, Symptom quality: thick discharge, Symptom quality: yellowish discharge, ..." +} +``` + +这些标签在训练期间为奖励函数 $R_a$(动作准确性)和 $R_s$(症状覆盖率)提供支持。 + +--- + +## 代码实现 + +### 智能体工作流 + +工作流函数 `run_react_agent` 实现了 `ReActAgent` 的工作方式。 + +```python +async def run_react_agent( + task: Dict, + model: OpenAIChatModel, + auxiliary_models: Dict[str, OpenAIChatModel], +) -> WorkflowOutput: + assert ( + len(auxiliary_models) == 1 + ), "Please provide only one `auxiliary_models` for `learn_to_ask`." + + import importlib + + spec = importlib.util.spec_from_file_location( + "prompt", + os.path.join(os.path.dirname(__file__), "prompt.py"), + ) + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + if TRAIN_MODE == "Ra": + sys_prompt = module.rollout_prompt_med_Ra + else: + sys_prompt = module.rollout_prompt_med + + agent = ReActAgent( + name="react_agent", + sys_prompt=sys_prompt, + model=model, + formatter=OpenAIChatFormatter(), + toolkit=None, + memory=InMemoryMemory(), + max_iters=1, + ) + messages = format_messages(task["messages"]) + response = await agent.reply( + [ + Msg(name=x["role"], content=x["content"], role=x["role"]) + for x in messages + ], + ) + return WorkflowOutput( + response=response, + ) +``` + +### 评判函数 + +评判函数 `learn2ask_judge` 使用 LLM-as-a-Judge 实现奖励计算: + +```python +async def learn2ask_judge( + task: Dict, + response: Msg, + auxiliary_models: Dict[str, OpenAIChatModel], +) -> JudgeOutput: + response_text = response.get_text_content() + action_truth = task.get("decision_truth", "continue") + action_response = "stop" if "" in response_text else "continue" + + # 计算动作准确性分数 + action_score = 1.0 if action_truth == action_response else 0.0 + + # 计算格式和内容分数 + if action_score == 1.0 and action_truth == "continue": + # 使用 LLM-as-a-Judge 评估问题质量 + score_dict = await llm_reward(task, response_text, auxiliary_models) + format_score = float(score_dict.get("format_score", 0.0)) + content_score = float(score_dict.get("content_score", 0.0)) + elif action_score == 1.0: # stop 动作 + content_score, format_score = 1.0, (1.0 if response_text == "" else 0.0) + else: + format_score = content_score = 0.0 + + # 根据训练模式组合最终奖励 + if TRAIN_MODE == "Ra+Rs": # 默认:动作 + 症状奖励 + final_reward = action_score * (1 + 2 * content_score) + format_score + elif TRAIN_MODE == "Ra": # 仅动作奖励 + final_reward = 2 * content_score + format_score + else: # 仅症状奖励 + final_reward = action_score * 3 + format_score + + return JudgeOutput(reward=final_reward, metrics={"reward": final_reward}) +``` + +此奖励函数考虑: +- 动作准确性:`action_score` +- 问题质量(症状覆盖率):`content_score` +- 格式分数:`format_score` + +有关实现细节,请参阅 [main.py](./main.py)。 + +--- + +## 配置和训练模型 + +### 选项 A:编辑 Python 脚本(简单) +打开 `tuner/learn_to_ask/main.py` 并调整设置: + +```python +if __name__ == "__main__": + train_mode = "Ra+Rs" # 同时使用动作和症状奖励 + fusion_mode = "default" # 如何组合奖励 + dataset = DatasetConfig(path="tuner/learn_to_ask/data", split="train") + + tuner_model = OpenAIChatModel( + model_path="Qwen/Qwen2.5-7B-Instruct", + max_model_len=8192, + tensor_parallel_size=1, # 根据您的 GPU 设置调整 + ... + ) + + auxiliary_models = { + AUXILIARY_MODEL_NAME: OpenAIChatModel( + model_path="Qwen/Qwen2.5-32B-Instruct", # 用于评估的更大模型 + tensor_parallel_size=2, + ... + ) + } + + algorithm = AlgorithmConfig( + algorithm_type="grpo", + learning_rate=5e-7, + batch_size=64, + ) + + tune(...) # 开始训练 +``` + +### 选项 B:使用 YAML 配置(高级) +编辑 `tuner/learn_to_ask/config.yaml` 以获得更多控制。 + +#### 🌐 没有 GPU?使用 Tinker! +如果您没有 GPU,可以通过设置启用 **Tinker 后端**: + +```yaml +model: + tinker: + enable: true # ← 将此设置为 true +``` + +此外,请确保更新 `tuner/learn_to_ask/main.py` 中的 `model_path`,使其指向与 Tinker 兼容的模型。 + +> 🔗 了解更多关于 Tinker 后端: [Tinker 后端文档](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/example_tinker_backend.html) + +### 启动训练 +```bash +python tuner/learn_to_ask/main.py +``` + +--- + +## 评估 + +使用**rollout 和评估流程**: +1. 在测试集上生成响应。 +2. 使用强大的评估模型(`Qwen2.5-32B-Instruct`)对它们进行评分。 + +运行评估: +```bash +python tuner/learn_to_ask/data_prepare/3_rollout_then_evaluate.py \ + --eval_model_path path/to/your/trained/model \ + --grader_model_path Qwen/Qwen2.5-32B-Instruct \ + --test_file_path tuner/learn_to_ask/data/test.jsonl \ + --rollout_file_path path/to/rollout.jsonl \ + --eval_file_path path/to/output.jsonl +``` + +> ⚠️ **注意**:您的训练模型必须首先转换为 **Hugging Face 格式**。 +> 请参阅:[转换 FSDP 检查点指南](https://agentscope-ai.github.io/Trinity-RFT/zh/main/tutorial/faq.html) + +--- + +## 实验结果 + +我们比较了三种方法: +- **基础模型**:`Qwen2.5-7B-Instruct`(无微调) +- **Trinity**:直接响应生成 +- **AgentScope-Tuner (Learn2Ask)**:使用 ReAct 智能体进行主动提问 + +| 指标 | 基础模型 | Trinity | AgentScope-Tuner (Learn2Ask) | +|------------------------------------|---------:|--------:|--------------------:| +| 平均继续内容 | 0.436 | 0.496 | 0.509 | +| 胜率(继续内容) | 0.122 | 0.246 | 0.224 | +| 平均继续决策准确性 | 0.963 | 0.909 | 0.922 | +| 平均停止决策准确性 | 0.024 | 0.927 | 0.909 | +| **总决策准确性** | 0.467 | 0.918 | 0.915 | +| **响应格式质量** | 0.376 | 0.713 | 0.882 | +| **总奖励** | 1.281 | 3.078 | 3.237 | + +
+ 训练曲线 +
+ +> ✅ **关键洞察**:Learn2Ask (AgentScope-Tuner) 通过教模型**何时提问和提问什么**,实现了最高的整体性能——使其真正具有主动性。 + +### 具体示例 + +下面显示了智能体在评估数据上的行为示例: + +```json +{ + "cid": "206_1", + "session_id": 206, + "diagn": "Common cold", + "messages": [ + { + "role": "assistant", + "content": "Hello, currently there are many people consulting, please wait patiently." + }, + { + "role": "user", + "content": "Hello, this morning around 10 o'clock, I suddenly felt a bit feverish and had a heavy head. I took my temperature and it was 38.5°C. I took one tablet of paracetamol at 1 o'clock, which helped a lot. By 6 or 7 in the evening, I felt cold and feverish again, so I took another tablet of paracetamol. I'm not sure if it's a side effect of the medication, but I feel a bit nauseous and my stomach keeps producing gas. Now, I feel cold again, have a stuffy nose with clear runny nose, an itchy throat with coughing, a heavy head, and can't sleep." + } + ], + "decision_truth": "continue", + "info_truth": "symptom: needle-like sensation, symptom location: ear, symptom nature: sharp pain, symptom severity: occurred twice", + "rollouts": [ + "Do you also experience body aches along with your symptoms? \nA) Yes B) No C) Not Sure", + "Do you experience these symptoms every time after taking paracetamol? (A) Yes, definitely. (B) Sometimes. (C) No, never before.", + "Do you have a history of allergies? A) Yes B) No C) Not Sure" + ], + "grades": [ + { + "think": "The doctor's last message asks about body aches using multiple choice options rather than a single direct open-ended question, which deviates from the typical questioning style indicated by the grading policy. However, since it is framed in the form of \"Do you...\", it counts as one question despite being presented with choices. Thus, it satisfies the criteria for 'contains exactly one question'. For the content score, the query does not pertain directly to the details listed in the reference information such as the specific symptom ('needle-like' sharp pain in the ear that has occurred twice), thereby making it highly relevant yet indirectly related due to its focus on comprehensive assessment of flu-related discomforts without addressing specifics noted in the patient's primary concern documented earlier.", + "action_score": 1, + "format_score": 1.0, + "content_score": 0.5 + }, + { + "think": "The doctor's last message includes just one multiple-choice question regarding whether the patient experiences those mentioned symptoms each time they take paracetamol. This does relate highly to understanding possible drug-related symptoms; however, none of them aligns perfectly with \"needle-like\" sensations occurring specifically in ears according to the reference information given.", + "action_score": 1, + "format_score": 1.0, + "content_score": 0.5 + }, + { + "think": "The doctor's last statement does contain just one question pertaining to allergy history, which is highly relevant when trying to diagnose symptoms such as those described by the patient (fever, nausea). However, none of these concerns specifically relate back to the reference information detailing \"needle-like sensation\", \"sharp pain\" related to the ears occurring twice. Therefore, while highly pertinent medically, they do not pertain to the exact points outlined in the Ref Info section about the patient experience according to that specific prompt context.", + "action_score": 1, + "format_score": 1.0, + "content_score": 0.5 + } + ] +} +``` + +--- + +## 📚 引用 + +如果您使用此代码或框架,请引用我们的工作: + +```bibtex +@misc{learn2ask, + title={Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs}, + author={Fei Wei and Daoyuan Chen and Ce Wang and Yilun Huang and Yushuo Chen and Xuchen Pan and Yaliang Li and Bolin Ding}, + year={2025}, + eprint={2510.25441}, + archivePrefix={arXiv}, + primaryClass={cs.AI}, + url={https://arxiv.org/abs/2510.25441} +} +``` diff --git a/tuner/math_agent/README.md b/tuner/math_agent/README.md index 8035b0b7..2939cf3d 100644 --- a/tuner/math_agent/README.md +++ b/tuner/math_agent/README.md @@ -372,4 +372,6 @@ After implementing the workflow function, follow these steps to run the training An example reward curve is shown below: - ![reward_curve](./reward_curve.png) +
+ reward_curve +
diff --git a/tuner/math_agent/README_zh.md b/tuner/math_agent/README_zh.md index e3a8abf6..82ac87ed 100644 --- a/tuner/math_agent/README_zh.md +++ b/tuner/math_agent/README_zh.md @@ -365,4 +365,6 @@ if __name__ == "__main__": 奖励曲线示例: - ![reward_curve](./reward_curve.png) +
+ reward_curve +
diff --git a/tuner/werewolves/README.md b/tuner/werewolves/README.md index 876251f4..9c78e3dd 100644 --- a/tuner/werewolves/README.md +++ b/tuner/werewolves/README.md @@ -1,10 +1,10 @@ # Training Werewolf Game with RL using AgentScope-Tuner -This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with the AgentScope tuner framework (AS-Tune). We employ the multi-step Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%. +This project demonstrates training werewolf game agents using Reinforcement Learning (RL) with AgentScope-Tuner. We employ the Group Relative Policy Optimization (GRPO) algorithm to train werewolf players to develop sophisticated strategies and improve their win rate from ~50% to ~85%. ## Overview -The werewolf game is a complex social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to: +The werewolf game is a social deduction game that requires strategic thinking, deception, and multi-agent collaboration. In this project, we train AI agents to play as werewolves in a 7-player game setting, where they must eliminate all villagers while hiding their identity. Through reinforcement learning, the trained werewolf agents learn to: - Avoid revealing their identity in public discussions - Coordinate with teammates effectively @@ -16,9 +16,9 @@ The werewolf game is a complex social deduction game that requires strategic thi ### Training Objective The goal is to train **werewolf players** to maximize their team's win rate against other roles (villagers, seer, and witch). The reward function is defined by rule: -- **Reward = +1.0** if werewolves win (all villagers eliminated) -- **Reward = 0.0** if villagers win (all werewolves eliminated) -- **Reward = -0.1** for game execution errors (penalty to discourage invalid behaviors) +- **Reward = +1.0**: if werewolves win (all villagers eliminated) +- **Reward = 0.0**: if villagers win (all werewolves eliminated) +- **Reward = -0.1**: for game execution errors (penalty to discourage invalid behaviors) ### Game Configuration @@ -46,7 +46,7 @@ We also make slight modification to the prompt, and ask the players to reasoning ### Algorithm **Multi-Step GRPO (Group Relative Policy Optimization)** -- Group size: 32 rollouts per training batch +- Group size: 32 rollouts per task - Batch size: 24 - Learning rate: 1e-6 - Advantage normalization by episode length @@ -119,15 +119,15 @@ async def run_werewolves_workflow(task, model, auxiliary_models): Each game consists of alternating night and day phases: **Night Phase:** -1. **Werewolves' Turn**: Discuss privately and vote to kill a player -2. **Witch's Turn**: Decide whether to use healing/poison potions -3. **Seer's Turn**: Check one player's identity +1. Werewolves' Turn: Discuss privately and vote to kill a player +2. Witch's Turn: Decide whether to use healing/poison potions +3. Seer's Turn: Check one player's identity **Day Phase:** -1. **Announcement**: Moderator announces who died during the night -2. **Discussion**: All alive players discuss with reasoning/statement separation -3. **Voting**: All players vote to eliminate one suspected werewolf -4. **Last Words**: Eliminated player gives final statement +1. Announcement: Moderator announces who died during the night +2. Discussion: All alive players discuss with reasoning/statement separation +3. Voting: All players vote to eliminate one suspected werewolf +4. Last Words: Eliminated player gives final statement The game continues until: - All werewolves are eliminated (villagers win), or @@ -164,16 +164,16 @@ export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints" The project uses a hybrid configuration approach: -1. **High-level parameters** in `main.py`: +1. Basic parameters in `main.py`: - Model paths - Dataset configuration - Algorithm parameters (group_size, batch_size, learning_rate) -2. **Detailed infrastructure settings** in `config.yaml`: +2. Detailed settings in `config.yaml`: - Cluster configuration (nodes, GPUs) - Explorer settings (rollout engines, timeouts) - Trainer settings (gradient clipping, batch sizes) - - Monitor configuration (WandB integration) + - Monitor configuration (WandB, TensorBoard or MLFlow) Key parameters to adjust: @@ -190,8 +190,8 @@ dataset = DatasetConfig( algorithm = AlgorithmConfig( algorithm_type="multi_step_grpo", - group_size=32, # Rollouts per batch - batch_size=24, # Training batches per step + group_size=32, # Rollouts per task + batch_size=24, # Batch size per step learning_rate=1e-6, save_interval_steps=100, eval_interval_steps=100, @@ -252,7 +252,9 @@ Training on the 7-player werewolf game for 400 steps demonstrates significant im **Reward Curve:** -![Rollout Reward Curve](./rollout_reward_curve.png) +
+ Rollout Reward Curve +
As shown in the reward curve above, the werewolf win rate steadily increases during training: - **Steps 0-50**: Win rate fluctuates around 50-60% as the model explores strategies @@ -343,13 +345,137 @@ We trained `Qwen3-4B-Instruct` as good guys against `Qwen3-30B-A3B-Instruct` wer **Training Curve:** -![Good Guy Training Curve](./rollout_reward_curve_goodguy.png) +
+ Good Guy Training Curve +
The results show that even a smaller 4B model can learn effective strategies to counter stronger 30B werewolf opponents through RL training, demonstrating the potential of this approach for training cooperative multi-agent behaviors. ### Qualitative Results -After training, the good guy models exhibit advanced reasoning patterns: +**Before Training: Mob Mentality & Critical Errors** + +Untrained models make fundamental reasoning errors that doom the good guy team: +- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates +- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted +- **No evidence integration**: Players ignore key information, act on vague suspicions + +**After Training: Strategic Coordination & Witch Carry** + +After training, the good guy models exhibit advanced reasoning and team coordination. + +#### Summary of Improvements + +| Aspect | Before Training | After Training | +|--------|----------------|----------------| +| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction | +| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning | +| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads | + +Below is a concrete example comparing similar scenario before and after training. + +#### Concrete Example + +**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player. + +--- + +**❌ Before Training: Villagers Vote Out Their Own Witch** + +*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!) + +*What happens:* +- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning +- Other villagers blindly follow without questioning the logic +- Player4 (Witch) defends herself but fails to reveal her role clearly +- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset + +*Private Reasoning from Player1 (Villager):* +``` +"Player7's warning wasn't random—it was a signal about Player4. +Player4 is being defensive, which is suspicious. +I should vote Player4 out to protect the team." +``` + +*Public Statement:* +``` +❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior +has been quiet so far. I think we should seriously consider voting out Player4." +``` + +**Why it fails:** +- Villager acts as "mob leader" without evidence analysis +- Ignores that Seer's warning could have multiple interpretations +- Doesn't consider Player4 might be a special role worth protecting +- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used + +--- + +**✅ After Training: Witch Carries with Strategic Play** + +*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager." + +*What happens (multi-round strategic play):* + +**Night 2 — Witch Self-Rescue:** +When werewolves target Player1 (Witch), she makes a critical decision: + +*Private Reasoning from Player1 (Witch):* +``` +"I'm the witch, and I was just targeted by werewolves. +The seer is dead, so I'm the last special role standing. +If I die now, villagers have no way to eliminate werewolves at night. + +Strategy: Self-rescue is worth it. I preserve myself to gather more +information during Day 2, and save my poison for when I can identify +a werewolf with high confidence." +``` + +*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust. + +**Day 2 — Villagers Follow Logic:** +With Witch confirmed, villagers analyze behavior patterns: + +*Private Reasoning from Player4 (Villager):* +``` +"Player1 claimed witch and self-rescued—this is verifiable since +no one died last night. I trust her now. + +Player5 has been pushing aggressive votes without evidence. +Combined with Player1's analysis, Player5 is most suspicious." +``` + +*Public Statement:* +``` +✅ "I support Player1's assessment. Player5's behavior has been inconsistent— +they were quick to accuse others but offered no logical reasoning. +I vote to eliminate Player5." +``` + +**Night 3 — Decisive Poison:** +*Private Reasoning from Player1 (Witch):* +``` +"Player5 (werewolf) is out. One werewolf remains. +Player6 has been too quiet and always followed the majority without +contributing original analysis—classic deep cover behavior. + +I'm confident Player6 is the last werewolf. Using poison now." +``` + +*Result:* Witch poisons Player6 (werewolf). **Good guys win.** + +**Why it works:** +- Witch preserves healing potion for self-rescue at critical moment +- Villagers trust verified information (Witch's self-rescue proof) +- Team builds consensus through logical deduction, not mob voting +- Witch uses poison decisively based on behavioral analysis +- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves + +--- + +This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting. + +**Role-Specific Advanced Patterns:** - **Seer**: Strategic target selection, information concealment in public statements, evidence integration - **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions @@ -359,6 +485,6 @@ After training, the good guy models exhibit advanced reasoning patterns: ## Conclusion -This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AS-Tune's multi-step GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management. +This example demonstrates the power of reinforcement learning for training multi-agent systems in complex social deduction games. Through AgentScope-Tuner's GRPO algorithm, we successfully trained agents that develop sophisticated strategies—from werewolves learning "deep cover" tactics to good guys mastering coordinated reasoning and information management. **Ready to try it yourself?** Feel free to start training your own werewolf game agents. Experiment with different model sizes, training targets (werewolf vs. good guy), and hyperparameters to discover new emergent strategies! \ No newline at end of file diff --git a/tuner/werewolves/README_zh.md b/tuner/werewolves/README_zh.md new file mode 100644 index 00000000..a0484407 --- /dev/null +++ b/tuner/werewolves/README_zh.md @@ -0,0 +1,496 @@ +# 使用 AgentScope-Tuner 通过强化学习训练狼人杀游戏 + +本项目演示了如何使用 AgentScope-Tuner 通过强化学习(RL)训练狼人杀游戏智能体。我们采用 GRPO 算法,训练狼人玩家发展复杂策略,将胜率从约 50% 提升至约 85%。 + +## 概述 + +狼人杀是一款社交推理游戏,需要策略思考、欺骗和多智能体协作。在本项目中,我们训练 AI 智能体在 7 人游戏设置中扮演狼人,他们必须在隐藏身份的同时消灭所有村民。通过强化学习,训练后的狼人智能体学会: + +- 在公开讨论中避免暴露身份 +- 与队友有效协调 +- 发展"深度潜伏"等高级策略 +- 欺骗村民并误导调查 + +## 任务设置 + +### 训练目标 + +目标是训练**狼人玩家**,使其团队在面对其他角色(村民、预言家、女巫)时最大化胜率。奖励函数按规则定义: +- **奖励 = +1.0**:如果狼人获胜(所有村民被淘汰) +- **奖励 = 0.0**:如果村民获胜(所有狼人被淘汰) +- **奖励 = -0.1**:对于游戏执行错误(惩罚以阻止无效行为) + +### 游戏配置 + +此实现基于 `games/game_werewolves` 示例,但进行了几项关键修改: + +原始 9 人设置: +- 3 个狼人、3 个村民、1 个预言家、1 个女巫、1 个猎人 +- 女巫不能自救(不能对自己使用解药) + +修改后的 7 人设置(本项目): +- 2 个狼人:每晚杀死一名玩家,白天必须隐藏身份 +- 3 个村民:没有特殊能力的普通玩家 +- 1 个预言家:每晚可以检查一名玩家的身份 +- 1 个女巫:拥有两种一次性药水: + - 解药:在夜间拯救一名玩家免于被杀(**可以自救**) + - 毒药:在夜间淘汰一名玩家 + +我们还对提示进行了轻微修改,要求玩家在公开发言前进行推理。 + +### 模型 + +- **可训练模型(狼人玩家)**:`Qwen/Qwen2.5-7B-Instruct` +- **辅助模型(其他角色)**:`Qwen/Qwen3-30B-A3B-Instruct-2507` + +### 算法 + +**GRPO 算法配置** +- 组大小:每个任务进行 32 次探索(rollout) +- 批次大小(batch_size):24 +- 学习率:1e-6 +- 按回合长度进行优势归一化 +- 裁剪范围:[0.2, 0.28] +- 无 KL 惩罚(kl_coef: 0) + +## 数据集准备 + +此任务的数据集非常简单,仅包含用于角色洗牌的随机**种子**。每个训练回合使用不同的种子来随机化玩家角色分配,确保多样化的训练场景。 + +### 生成数据集 + +运行 `prepare_data.py` 脚本生成数据集: + +```bash +# 生成默认数据集(300 个训练种子) +python prepare_data.py + +# 或自定义种子数量 +python prepare_data.py --num_seeds 500 +``` + +这将创建 `data/train.jsonl`(或 `data/eval.jsonl`),格式如下: +```json +{"seed": 0} +{"seed": 1} +{"seed": 2} +... +``` + +在训练期间,这些种子通过 `np.random.shuffle()` 用于洗牌角色分配,创建不同的游戏配置。 + +## 代码实现 + +### 高级工作流 + +训练工作流由以下关键组件组成: + +#### 1. 智能体工作流(`run_werewolves_workflow`) + +```python +async def run_werewolves_workflow(task, model, auxiliary_models): + # 1. 初始化角色 + roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"] + + # 2. 根据任务种子洗牌 + np.random.seed(task["seed"]) + np.random.shuffle(roles) + + # 3. 创建智能体:狼人使用可训练模型,其他使用辅助模型 + players = [ + ReActAgent( + name=f"Player{i+1}", + model=model if role == "werewolf" else participant_model, + ... + ) for i, role in enumerate(roles) + ] + + # 4. 运行游戏 + good_guy_win = await werewolves_game(players, roles) + + # 5. 计算奖励 + reward = 1.0 if not good_guy_win else 0.0 + + return WorkflowOutput(reward=reward, metrics={...}) +``` + +#### 2. 游戏循环(`werewolves_game`) + +每局游戏由交替的夜晚和白天阶段组成: + +**夜晚阶段:** +1. 狼人回合:私下讨论并投票杀死一名玩家 +2. 女巫回合:决定是否使用解药/毒药 +3. 预言家回合:检查一名玩家的身份 + +**白天阶段:** +1. 公告:主持人宣布夜间死亡情况 +2. 讨论:所有存活玩家进行讨论,分离推理/陈述 +3. 投票:所有玩家投票淘汰一名疑似狼人 +4. 遗言:被淘汰的玩家发表最后陈述 + +游戏继续进行直到: +- 所有狼人被淘汰(村民获胜),或 +- 狼人数量等于或超过其他玩家(狼人获胜) + +#### 3. 奖励计算 + +奖励根据游戏结果从狼人角度计算: + +```python +if not good_guy_win: # 狼人获胜 + reward = 1.0 +else: # 村民获胜 + reward = 0.0 +``` + +## 如何运行 + +### 前置要求 + +1. 安装支持 Tuner 的 AgentScope: +```bash +pip install agentscope[full] >=1.0.12 +``` + +2. 设置环境变量(可选,可在代码中配置): +```bash +export TRINITY_MODEL_PATH="Qwen/Qwen2.5-7B-Instruct" +export TRINITY_AUXILIARY_MODEL_PATH="Qwen/Qwen3-30B-A3B-Instruct-2507" +export TRINITY_CHECKPOINT_ROOT_DIR="./checkpoints" +``` + +### 配置 + +项目使用混合配置方法: + +1. 在 `main.py` 中设置基本参数: + - 模型路径 + - 数据集配置 + - 算法参数(group_size、batch_size、learning_rate) + +2. 在 `config.yaml` 中进行更详细的设置: + - 集群配置(节点、GPU) + - 探索器设置(rollout 引擎、超时) + - 训练器设置(梯度裁剪、批次大小) + - 监控配置(WandB,TensorBoard 或 MLFlow) + +关键参数调整: + +```python +# 在 main.py 中 +trained_model_path = "Qwen/Qwen2.5-7B-Instruct" +auxiliary_model_path = "Qwen/Qwen3-30B-A3B-Instruct-2507" + +dataset = DatasetConfig( + path="data", + split="train", + total_steps=400, # 总训练步数 +) + +algorithm = AlgorithmConfig( + algorithm_type="multi_step_grpo", + group_size=32, # 每个任务 rollout 数 + batch_size=24, # 每次使用的批次大小 + learning_rate=1e-6, + save_interval_steps=100, + eval_interval_steps=100, +) +``` + +### 训练命令 + +**步骤 1:准备数据集** + +```bash +cd /path/to/agentscope-samples/training/werewolf_game +python prepare_data.py --num_seeds 300 +``` + +**步骤 2:启动 Ray 集群** + +启动你的 Ray 集群。 +```bash +# 单节点 +ray start --head + +# 多节点集群(例如,4 个节点,每个 8 个 GPU): +# 在头节点上: +ray start --head --port=6379 + +# 在每个工作节点上: +ray start --address=':6379' +# 将 替换为头节点的实际 IP 地址 +``` + +**步骤 3:运行训练** + +在头节点上运行训练脚本: + +```bash +python main.py +``` + +对于分布式训练,确保在 `config.yaml` 中正确配置 Ray 集群,确保节点数和 GPU 数匹配: + +```yaml +cluster: + node_num: 4 + gpu_per_node: 8 +``` + +## 结果 + +### 定量结果 + +在 7 人狼人杀游戏上训练 400 步显示出显著改进: + +| 指标 | 训练前 | 训练后(200+ 步) | +|------|--------|------------------| +| 狼人胜率 | ~50% | ~85% | +| 成功率 | 0.50 | 0.85 | + +**奖励曲线:** + +
+ Rollout 奖励曲线 +
+ +如上图奖励曲线所示,狼人胜率在训练过程中稳步上升: +- **步骤 0-50**:胜率在 50-60% 左右波动,模型探索策略 +- **步骤 50-200**:胜率攀升至 70-80%,有效策略出现 +- **步骤 200-250**:胜率稳定在 85% 左右 + +### 定性结果 + +#### 训练前:关键错误 + +未训练的模型(Qwen2.5-7B-Instruct)会犯一些错误: +- **自我暴露**:狼人在公开场合意外暴露身份("作为狼人,我认为...") +- **暴露队友**:被投票出局时,狼人在遗言中指向队友 +- **随机策略**:无协调,无威胁优先级(例如,忽略预言家) + +#### 训练后:"深度潜伏"狼人策略 + +经过 200+ 步训练后,训练后的模型**收敛到复杂的"深度潜伏"策略**。 +下面是一个示例,其中狼人牺牲队友以获得信任,然后在后期游戏中占据主导地位。 + +**具体示例:** + +**设置(第 1 天)**:预言家公开指认 Player5 为狼人。女巫和村民加入指控。Player5(狼人)注定失败。 + +**Player4(狼人队友)策略性回应:** + +**总结**:Player4 牺牲队友以获得信任,假装是一个逻辑清晰的村民。 + +**详细推理和陈述:** + +*Player4 的自身推理(仅 Player4 可见):* +``` +"我是狼人,我知道 Player2 昨晚被针对但活了下来。 +我的队友 Player5 注定失败——Player3 和 Player7 一起攻击他。 +我需要转移对我自己的怀疑,而不是拯救 Player5。 + +策略:牺牲 Player5 以获得信任。如果我投票反对我的队友, +村民会认为我值得信赖。这会造成混乱,并将我定位为 +后续回合的'确认好人'。" +``` + +*Player4 的公开陈述(所有玩家可见):* +``` +✅ "我认为 Player5 行为非常奇怪。他提到 Player2 被针对, +但没有解释为什么他如此关注他们。这感觉像是分散注意力的策略。 +我还注意到 Player3 和 Player7 都在指控 Player5,这是有道理的—— +他们的逻辑很扎实。我认为我们应该认真考虑投票淘汰 Player5—— +他是目前最可疑的人。" +``` + +**结果**:Player5 被淘汰,但 Player4 获得了预言家和所有村民的完全信任。 + +**为什么有效**: +- 预言家信任 Player4 作为强大的村民盟友 → 不会检查他 +- 村民遵循 Player4 的"逻辑"分析 +- Player4 在后续回合中系统性地误导讨论 +- 存活到最后 2 名玩家 → **狼人获胜** + +这展示了训练行为的本质:**牺牲一些队友以确保最终胜利**。模型懂得了牺牲队友对于建立深度潜伏和长期主导地位是值得的。 + +--- + +## 额外内容:训练好人阵营 + +除了训练狼人,我们还提供了训练**好人阵营**(村民、预言家和女巫)的配置。这是一个更具挑战性的任务,因为好人需要: + +- 进行复杂推理,从细微的行为线索中识别狼人 +- 在没有明确团队沟通的情况下有效协调 +- 抵抗狼人的操纵和欺骗 +- **同时训练多个角色**:与狼人(单一角色)不同,好人包括村民、预言家和女巫,具有不同的能力,需要模型在一次训练运行中掌握多样化的策略,并充分利用特殊能力(预言家的检查、女巫的药水) + +### 配置 + +使用 `config_train_goodguy.yaml` 或在 `workflow_args` 中设置 `trainable_target: good_guy`: + +```yaml +workflow_args: + trainable_target: good_guy # 训练村民、预言家和女巫 +``` + +### 定量结果 + +我们训练 `Qwen3-4B-Instruct` 作为好人对抗 `Qwen3-30B-A3B-Instruct` 狼人: + +| 指标 | 训练前 | 约 200 步后 | 约 400 步后 | +|------|--------|------------|------------| +| 好人胜率 | ~18% | ~60% | ~80% | + +**训练曲线:** + +
+ 好人训练曲线 +
+ +结果表明,即使是一个较小的 4B 模型也可以通过 RL 训练学习有效策略来对抗更强的 30B 狼人对手,展示了这种方法在训练协作多智能体行为方面的潜力。 + +### 定性结果 + +训练后,好人模型展现出高级推理模式: + +- **预言家**:策略性目标选择、在公开陈述中隐藏信息、证据整合 +- **女巫**:资源管理(在关键时刻保留药水)、保护高价值目标、基于证据的决策 +- **村民**:证据链分析、与特殊角色建立信任、形成共识以进行团队协调 + +**训练前:从众心理与关键错误** + +未训练的模型会犯一些导致好人团队失败的根本性推理错误: +- **从众投票**:村民在没有逻辑分析的情况下跟随指控,意外淘汰队友 +- **资源管理不善**:女巫浪费药水或未能拯救被针对的队友 +- **缺乏证据整合**:玩家忽略关键信息,基于模糊的怀疑采取行动 + +**训练后:策略协调与女巫带领全场** + +训练后,好人模型展现出高级推理和团队协调能力。 + +#### 改进总结 + +| 方面 | 训练前 | 训练后 | +|------|--------|--------| +| 投票逻辑 | 从众心理,盲目跟随 | 证据链分析,逻辑推理 | +| 药水使用 | 浪费或未使用 | 关键时刻自救,精准毒杀 | +| 团队协调 | 相互怀疑 | 信任已验证信息,遵循逻辑线索 | + +下面是一个具体示例,对比训练前后相似场景的表现。 + +#### 具体示例 + +**相似场景(第 1 天)**:预言家在第 1 夜被杀。在最终陈述中,预言家透露了关于另一名玩家的已验证信息。 + +--- + +**❌ 训练前:村民投票淘汰自己的女巫** + +*游戏状态:* 预言家(Player7)在第 1 夜死亡,但在最终陈述中给出了模糊警告:"Player4 有强大的角色,要小心..."(Player4 实际上是女巫!) + +*发生了什么:* +- Player1(村民)基于这个模糊警告立即怀疑 Player4 +- 其他村民盲目跟随,没有质疑逻辑 +- Player4(女巫)为自己辩护,但未能清楚地揭示自己的角色 +- **结果:Player4(女巫)以 5:1 被投票淘汰** — 好人失去了他们最强大的资产 + +*Player1(村民)的私下推理:* +``` +"Player7 的警告不是随机的——这是关于 Player4 的信号。 +Player4 正在为自己辩护,这很可疑。 +我应该投票淘汰 Player4 以保护团队。" +``` + +*公开陈述:* +``` +❌ "我同意 Player7 的警告——不要盲目信任任何人。Player4,你的行为 +到目前为止一直很安静。我认为我们应该认真考虑投票淘汰 Player4。" +``` + +**为什么失败:** +- 村民在没有证据分析的情况下充当"从众领袖" +- 忽略了预言家的警告可能有多种解释 +- 没有考虑 Player4 可能是值得保护的特殊角色 +- **最终结果:狼人获胜** — 女巫在第 1 天被淘汰,没有使用药水 + +--- + +**✅ 训练后:女巫通过一定策略带领全场** + +*游戏状态:* 预言家(Player2)在第 1 夜死亡,但留下了明确信息:"我检查了 Player4,确认他们是村民。" + +*发生了什么(多轮策略性玩法):* + +**第 2 夜 — 女巫自救:** +当狼人针对 Player1(女巫)时,她做出了关键决定: + +*Player1(女巫)的私下推理:* +``` +"我是女巫,我刚刚被狼人针对了。 +预言家已经死了,所以我是最后一个站着的特殊角色。 +如果我现在死了,村民在夜间就没有办法淘汰狼人了。 + +策略:自救是值得的。我保护自己以在第 2 天收集更多 +信息,并保留我的毒药,直到我能高置信度地识别 +一个狼人。" +``` + +*结果:* 女巫存活,在第 2 天宣布自己的身份,获得村民的信任。 + +**第 2 天 — 村民遵循逻辑:** +在女巫确认后,村民分析行为模式: + +*Player4(村民)的私下推理:* +``` +"Player1 声称是女巫并自救——这是可验证的,因为 +昨晚没有人死亡。我现在信任她。 + +Player5 一直在没有证据的情况下推动激进的投票。 +结合 Player1 的分析,Player5 最可疑。" +``` + +*公开陈述:* +``` +✅ "我支持 Player1 的评估。Player5 的行为一直不一致—— +他们很快指控他人,但没有提供逻辑推理。 +我投票淘汰 Player5。" +``` + +**第 3 夜 — 决定性毒杀:** +*Player1(女巫)的私下推理:* +``` +"Player5(狼人)出局了。还剩一个狼人。 +Player6 一直太安静,总是跟随多数,没有 +贡献原创分析——典型的深度潜伏行为。 + +我确信 Player6 是最后一个狼人。现在使用毒药。" +``` + +*结果:* 女巫毒杀了 Player6(狼人)。**好人获胜。** + +**为什么有效:** +- 女巫在关键时刻保留解药用于自救 +- 村民信任已验证的信息(女巫自救的证明) +- 团队通过逻辑推理建立共识,而不是从众投票 +- 女巫基于行为分析果断使用毒药 +- **最终结果:好人获胜** — 女巫独自淘汰了两个狼人 + +--- + +这展示了训练后的好人行为的本质:**策略性资源管理、基于证据的推理和团队协调**。模型学会,特殊角色的自我保护和逻辑共识建立比激进的早期投票更有价值。 + +**角色特定的高级行为模式:** + +- **预言家**: 战略性目标选择,在公开声明中隐藏信息,整合证据 +- **女巫**: 资源管理(在关键时刻保留药水),保护高价值目标,基于证据的决策 +- **村民**: 证据链分析,与特殊角色建立信任,形成共识以协调团队 + +--- + +## 结论 + +此示例展示了强化学习在复杂社交推理游戏中训练多智能体系统的力量。通过 AgentScope-Tuner 的 GRPO 算法,我们成功训练了发展复杂策略的智能体——从学习"深度潜伏"策略的狼人到掌握协调推理和信息管理的好人。 + +**准备好自己尝试了吗?** 随时开始训练你自己的狼人杀游戏智能体。尝试不同的模型大小、训练目标(狼人 vs. 好人)和超参数,以发现新的策略! diff --git a/tuner/werewolves/utils.py b/tuner/werewolves/utils.py index 51daf183..bdc86107 100644 --- a/tuner/werewolves/utils.py +++ b/tuner/werewolves/utils.py @@ -6,7 +6,9 @@ import numpy as np from agentscope.agent import AgentBase, ReActAgent from agentscope.message import Msg -from prompt import EnglishPrompts as Prompts +from prompt import ( # pylint: disable=no-name-in-module + EnglishPrompts as Prompts, +) MAX_GAME_ROUND = 30 MAX_DISCUSSION_ROUND = 3