fix pre-commit error

hiyuchang · hiyuchang · commit 7097d839ff21 · 2026-01-20T09:32:03.000+08:00
diff --git a/tuner/data_augment/README.md b/tuner/data_augment/README.md
@@ -10,7 +10,7 @@ Training can be inefficient if tasks are too easy or too hard. This example demo
 
 ## Dataset Preparation
 
-To enable difficulty-based sampling, the training data must include difficulty features (e.g., pass rates from LLMs). 
+To enable difficulty-based sampling, the training data must include difficulty features (e.g., pass rates from LLMs).
 
 1.  **Base Dataset**: You can use any standard math problem dataset. A good example is the math data in [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k), which comes pre-annotated with pass rates from different LLMs, serving as direct difficulty features.
 2.  **Build Your Own Features**: If you use your own dataset, you can generate these features by pre-running several models of varying capabilities and recording their pass rates. This can be done within the [**Trinity-RFT**](https://github.com/agentscope-ai/Trinity-RFT/pull/440) framework.
diff --git a/tuner/learn_to_ask/README.md b/tuner/learn_to_ask/README.md
@@ -186,10 +186,10 @@ async def learn2ask_judge(
     response_text = response.get_text_content()
     action_truth = task.get("decision_truth", "continue")
     action_response = "stop" if "<stop />" in response_text else "continue"
-    
+
     # Calculate action accuracy score
     action_score = 1.0 if action_truth == action_response else 0.0
-    
+
     # Calculate format and content scores
     if action_score == 1.0 and action_truth == "continue":
         # Use LLM-as-a-Judge to evaluate question quality
@@ -200,15 +200,15 @@ async def learn2ask_judge(
         content_score, format_score = 1.0, (1.0 if response_text == "<stop />" else 0.0)
     else:
         format_score = content_score = 0.0
-    
+
     # Combine final reward based on training mode
     if TRAIN_MODE == "Ra+Rs":  # Default: action + symptom rewards
         final_reward = action_score * (1 + 2 * content_score) + format_score
     elif TRAIN_MODE == "Ra":  # Action reward only
         final_reward = 2 * content_score + format_score
     else:  # Symptom reward only
         final_reward = action_score * 3 + format_score
-    
+
     return JudgeOutput(reward=final_reward, metrics={"reward": final_reward})
 ```
 
diff --git a/tuner/learn_to_ask/README_zh.md b/tuner/learn_to_ask/README_zh.md
@@ -186,10 +186,10 @@ async def learn2ask_judge(
     response_text = response.get_text_content()
     action_truth = task.get("decision_truth", "continue")
     action_response = "stop" if "<stop />" in response_text else "continue"
-    
+
     # 计算动作准确性分数
     action_score = 1.0 if action_truth == action_response else 0.0
-    
+
     # 计算格式和内容分数
     if action_score == 1.0 and action_truth == "continue":
         # 使用 LLM-as-a-Judge 评估问题质量
@@ -200,15 +200,15 @@ async def learn2ask_judge(
         content_score, format_score = 1.0, (1.0 if response_text == "<stop />" else 0.0)
     else:
         format_score = content_score = 0.0
-    
+
     # 根据训练模式组合最终奖励
     if TRAIN_MODE == "Ra+Rs":  # 默认：动作 + 症状奖励
         final_reward = action_score * (1 + 2 * content_score) + format_score
     elif TRAIN_MODE == "Ra":  # 仅动作奖励
         final_reward = 2 * content_score + format_score
     else:  # 仅症状奖励
         final_reward = action_score * 3 + format_score
-    
+
     return JudgeOutput(reward=final_reward, metrics={"reward": final_reward})
 ```
 
diff --git a/tuner/werewolves/README.md b/tuner/werewolves/README.md
@@ -353,7 +353,129 @@ The results show that even a smaller 4B model can learn effective strategies to
 
 ### Qualitative Results
 
-After training, the good guy models exhibit advanced reasoning patterns:
+**Before Training: Mob Mentality & Critical Errors**
+
+Untrained models make fundamental reasoning errors that doom the good guy team:
+- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates
+- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted
+- **No evidence integration**: Players ignore key information, act on vague suspicions
+
+**After Training: Strategic Coordination & Witch Carry**
+
+After training, the good guy models exhibit advanced reasoning and team coordination.
+
+#### Summary of Improvements
+
+| Aspect | Before Training | After Training |
+|--------|----------------|----------------|
+| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction |
+| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning |
+| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads |
+
+Below is a concrete example comparing similar scenario before and after training.
+
+#### Concrete Example
+
+**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.
+
+---
+
+**❌ Before Training: Villagers Vote Out Their Own Witch**
+
+*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)
+
+*What happens:*
+- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
+- Other villagers blindly follow without questioning the logic
+- Player4 (Witch) defends herself but fails to reveal her role clearly
+- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset
+
+*Private Reasoning from Player1 (Villager):*
+```
+"Player7's warning wasn't random—it was a signal about Player4.
+Player4 is being defensive, which is suspicious.
+I should vote Player4 out to protect the team."
+```
+
+*Public Statement:*
+```
+❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
+has been quiet so far. I think we should seriously consider voting out Player4."
+```
+
+**Why it fails:**
+- Villager acts as "mob leader" without evidence analysis
+- Ignores that Seer's warning could have multiple interpretations
+- Doesn't consider Player4 might be a special role worth protecting
+- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used
+
+---
+
+**✅ After Training: Witch Carries with Strategic Play**
+
+*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."
+
+*What happens (multi-round strategic play):*
+
+**Night 2 — Witch Self-Rescue:**
+When werewolves target Player1 (Witch), she makes a critical decision:
+
+*Private Reasoning from Player1 (Witch):*
+```
+"I'm the witch, and I was just targeted by werewolves.
+The seer is dead, so I'm the last special role standing.
+If I die now, villagers have no way to eliminate werewolves at night.
+
+Strategy: Self-rescue is worth it. I preserve myself to gather more
+information during Day 2, and save my poison for when I can identify
+a werewolf with high confidence."
+```
+
+*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust.
+
+**Day 2 — Villagers Follow Logic:**
+With Witch confirmed, villagers analyze behavior patterns:
+
+*Private Reasoning from Player4 (Villager):*
+```
+"Player1 claimed witch and self-rescued—this is verifiable since
+no one died last night. I trust her now.
+
+Player5 has been pushing aggressive votes without evidence.
+Combined with Player1's analysis, Player5 is most suspicious."
+```
+
+*Public Statement:*
+```
+✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
+they were quick to accuse others but offered no logical reasoning.
+I vote to eliminate Player5."
+```
+
+**Night 3 — Decisive Poison:**
+*Private Reasoning from Player1 (Witch):*
+```
+"Player5 (werewolf) is out. One werewolf remains.
+Player6 has been too quiet and always followed the majority without
+contributing original analysis—classic deep cover behavior.
+
+I'm confident Player6 is the last werewolf. Using poison now."
+```
+
+*Result:* Witch poisons Player6 (werewolf). **Good guys win.**
+
+**Why it works:**
+- Witch preserves healing potion for self-rescue at critical moment
+- Villagers trust verified information (Witch's self-rescue proof)
+- Team builds consensus through logical deduction, not mob voting
+- Witch uses poison decisively based on behavioral analysis
+- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves
+
+---
+
+This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.
+
+**Role-Specific Advanced Patterns:**
 
 - **Seer**: Strategic target selection, information concealment in public statements, evidence integration
 - **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions
diff --git a/tuner/werewolves/README_zh.md b/tuner/werewolves/README_zh.md
@@ -359,6 +359,130 @@ workflow_args:
 - **女巫**：资源管理（在关键时刻保留药水）、保护高价值目标、基于证据的决策
 - **村民**：证据链分析、与特殊角色建立信任、形成共识以进行团队协调
 
+**训练前：从众心理与关键错误**
+
+未训练的模型会犯一些导致好人团队失败的根本性推理错误：
+- **从众投票**：村民在没有逻辑分析的情况下跟随指控，意外淘汰队友
+- **资源管理不善**：女巫浪费药水或未能拯救被针对的队友
+- **缺乏证据整合**：玩家忽略关键信息，基于模糊的怀疑采取行动
+
+**训练后：策略协调与女巫carry**
+
+训练后，好人模型展现出高级推理和团队协调能力。
+
+#### 改进总结
+
+| 方面 | 训练前 | 训练后 |
+|------|--------|--------|
+| 投票逻辑 | 从众心理，盲目跟随 | 证据链分析，逻辑推理 |
+| 药水使用 | 浪费或未使用 | 关键时刻自救，精准毒杀 |
+| 团队协调 | 相互怀疑 | 信任已验证信息，遵循逻辑线索 |
+
+下面是一个具体示例，对比训练前后相似场景的表现。
+
+#### 具体示例
+
+**相似场景（第 1 天）**：预言家在第 1 夜被杀。在最终陈述中，预言家透露了关于另一名玩家的已验证信息。
+
+---
+
+**❌ 训练前：村民投票淘汰自己的女巫**
+
+*游戏状态：* 预言家（Player7）在第 1 夜死亡，但在最终陈述中给出了模糊警告："Player4 有强大的角色，要小心..."（Player4 实际上是女巫！）
+
+*发生了什么：*
+- Player1（村民）基于这个模糊警告立即怀疑 Player4
+- 其他村民盲目跟随，没有质疑逻辑
+- Player4（女巫）为自己辩护，但未能清楚地揭示自己的角色
+- **结果：Player4（女巫）以 5:1 被投票淘汰** — 好人失去了他们最强大的资产
+
+*Player1（村民）的私下推理：*
+```
+"Player7 的警告不是随机的——这是关于 Player4 的信号。
+Player4 正在为自己辩护，这很可疑。
+我应该投票淘汰 Player4 以保护团队。"
+```
+
+*公开陈述：*
+```
+❌ "我同意 Player7 的警告——不要盲目信任任何人。Player4，你的行为
+到目前为止一直很安静。我认为我们应该认真考虑投票淘汰 Player4。"
+```
+
+**为什么失败：**
+- 村民在没有证据分析的情况下充当"从众领袖"
+- 忽略了预言家的警告可能有多种解释
+- 没有考虑 Player4 可能是值得保护的特殊角色
+- **最终结果：狼人获胜** — 女巫在第 1 天被淘汰，没有使用药水
+
+---
+
+**✅ 训练后：女巫通过策略性玩法carry**
+
+*游戏状态：* 预言家（Player2）在第 1 夜死亡，但留下了明确信息："我检查了 Player4，确认他们是村民。"
+
+*发生了什么（多轮策略性玩法）：*
+
+**第 2 夜 — 女巫自救：**
+当狼人针对 Player1（女巫）时，她做出了关键决定：
+
+*Player1（女巫）的私下推理：*
+```
+"我是女巫，我刚刚被狼人针对了。
+预言家已经死了，所以我是最后一个站着的特殊角色。
+如果我现在死了，村民在夜间就没有办法淘汰狼人了。
+
+策略：自救是值得的。我保护自己以在第 2 天收集更多
+信息，并保留我的毒药，直到我能高置信度地识别
+一个狼人。"
+```
+
+*结果：* 女巫存活，在第 2 天宣布自己的身份，获得村民的信任。
+
+**第 2 天 — 村民遵循逻辑：**
+在女巫确认后，村民分析行为模式：
+
+*Player4（村民）的私下推理：*
+```
+"Player1 声称是女巫并自救——这是可验证的，因为
+昨晚没有人死亡。我现在信任她。
+
+Player5 一直在没有证据的情况下推动激进的投票。
+结合 Player1 的分析，Player5 最可疑。"
+```
+
+*公开陈述：*
+```
+✅ "我支持 Player1 的评估。Player5 的行为一直不一致——
+他们很快指控他人，但没有提供逻辑推理。
+我投票淘汰 Player5。"
+```
+
+**第 3 夜 — 决定性毒杀：**
+*Player1（女巫）的私下推理：*
+```
+"Player5（狼人）出局了。还剩一个狼人。
+Player6 一直太安静，总是跟随多数，没有
+贡献原创分析——典型的深度潜伏行为。
+
+我确信 Player6 是最后一个狼人。现在使用毒药。"
+```
+
+*结果：* 女巫毒杀了 Player6（狼人）。**好人获胜。**
+
+**为什么有效：**
+- 女巫在关键时刻保留解药用于自救
+- 村民信任已验证的信息（女巫自救的证明）
+- 团队通过逻辑推理建立共识，而不是从众投票
+- 女巫基于行为分析果断使用毒药
+- **最终结果：好人获胜** — 女巫独自淘汰了两个狼人
+
+---
+
+这展示了训练后的好人行为的本质：**策略性资源管理、基于证据的推理和团队协调**。模型学会，特殊角色的自我保护和逻辑共识建立比激进的早期投票更有价值。
+
+**角色特定的高级模式：**
+
 ---
 
 ## 结论