Skip to content

Commit 7097d83

Browse files
committed
fix pre-commit error
1 parent 94b9fd4 commit 7097d83

File tree

5 files changed

+256
-10
lines changed

5 files changed

+256
-10
lines changed

tuner/data_augment/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Training can be inefficient if tasks are too easy or too hard. This example demo
1010

1111
## Dataset Preparation
1212

13-
To enable difficulty-based sampling, the training data must include difficulty features (e.g., pass rates from LLMs).
13+
To enable difficulty-based sampling, the training data must include difficulty features (e.g., pass rates from LLMs).
1414

1515
1. **Base Dataset**: You can use any standard math problem dataset. A good example is the math data in [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k), which comes pre-annotated with pass rates from different LLMs, serving as direct difficulty features.
1616
2. **Build Your Own Features**: If you use your own dataset, you can generate these features by pre-running several models of varying capabilities and recording their pass rates. This can be done within the [**Trinity-RFT**](https://github.com/agentscope-ai/Trinity-RFT/pull/440) framework.

tuner/learn_to_ask/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -186,10 +186,10 @@ async def learn2ask_judge(
186186
response_text = response.get_text_content()
187187
action_truth = task.get("decision_truth", "continue")
188188
action_response = "stop" if "<stop />" in response_text else "continue"
189-
189+
190190
# Calculate action accuracy score
191191
action_score = 1.0 if action_truth == action_response else 0.0
192-
192+
193193
# Calculate format and content scores
194194
if action_score == 1.0 and action_truth == "continue":
195195
# Use LLM-as-a-Judge to evaluate question quality
@@ -200,15 +200,15 @@ async def learn2ask_judge(
200200
content_score, format_score = 1.0, (1.0 if response_text == "<stop />" else 0.0)
201201
else:
202202
format_score = content_score = 0.0
203-
203+
204204
# Combine final reward based on training mode
205205
if TRAIN_MODE == "Ra+Rs": # Default: action + symptom rewards
206206
final_reward = action_score * (1 + 2 * content_score) + format_score
207207
elif TRAIN_MODE == "Ra": # Action reward only
208208
final_reward = 2 * content_score + format_score
209209
else: # Symptom reward only
210210
final_reward = action_score * 3 + format_score
211-
211+
212212
return JudgeOutput(reward=final_reward, metrics={"reward": final_reward})
213213
```
214214

tuner/learn_to_ask/README_zh.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -186,10 +186,10 @@ async def learn2ask_judge(
186186
response_text = response.get_text_content()
187187
action_truth = task.get("decision_truth", "continue")
188188
action_response = "stop" if "<stop />" in response_text else "continue"
189-
189+
190190
# 计算动作准确性分数
191191
action_score = 1.0 if action_truth == action_response else 0.0
192-
192+
193193
# 计算格式和内容分数
194194
if action_score == 1.0 and action_truth == "continue":
195195
# 使用 LLM-as-a-Judge 评估问题质量
@@ -200,15 +200,15 @@ async def learn2ask_judge(
200200
content_score, format_score = 1.0, (1.0 if response_text == "<stop />" else 0.0)
201201
else:
202202
format_score = content_score = 0.0
203-
203+
204204
# 根据训练模式组合最终奖励
205205
if TRAIN_MODE == "Ra+Rs": # 默认:动作 + 症状奖励
206206
final_reward = action_score * (1 + 2 * content_score) + format_score
207207
elif TRAIN_MODE == "Ra": # 仅动作奖励
208208
final_reward = 2 * content_score + format_score
209209
else: # 仅症状奖励
210210
final_reward = action_score * 3 + format_score
211-
211+
212212
return JudgeOutput(reward=final_reward, metrics={"reward": final_reward})
213213
```
214214

tuner/werewolves/README.md

Lines changed: 123 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -353,7 +353,129 @@ The results show that even a smaller 4B model can learn effective strategies to
353353

354354
### Qualitative Results
355355

356-
After training, the good guy models exhibit advanced reasoning patterns:
356+
**Before Training: Mob Mentality & Critical Errors**
357+
358+
Untrained models make fundamental reasoning errors that doom the good guy team:
359+
- **Mob voting**: Villagers follow accusations without logical analysis, accidentally eliminating teammates
360+
- **Poor resource management**: Witch wastes potions or fails rescue teammate targeted
361+
- **No evidence integration**: Players ignore key information, act on vague suspicions
362+
363+
**After Training: Strategic Coordination & Witch Carry**
364+
365+
After training, the good guy models exhibit advanced reasoning and team coordination.
366+
367+
#### Summary of Improvements
368+
369+
| Aspect | Before Training | After Training |
370+
|--------|----------------|----------------|
371+
| Voting Logic | Mob mentality, blind following | Evidence-chain analysis, logical deduction |
372+
| Potion Usage | Wasted or unused | Self-rescue at critical moments, precise poisoning |
373+
| Team Coordination | Mutual suspicion | Trust verified information, follow logical leads |
374+
375+
Below is a concrete example comparing similar scenario before and after training.
376+
377+
#### Concrete Example
378+
379+
**Similar Scenario(Day 1):** Seer is killed on Night 1. In their final statement, Seer reveals verified information about another player.
380+
381+
---
382+
383+
**❌ Before Training: Villagers Vote Out Their Own Witch**
384+
385+
*Game State:* Seer (Player7) dies Night 1, but in final statement gives a vague warning: "Player4 has a powerful role, be careful..." (Player4 is actually the Witch!)
386+
387+
*What happens:*
388+
- Player1 (Villager) immediately suspects Player4 based on the ambiguous warning
389+
- Other villagers blindly follow without questioning the logic
390+
- Player4 (Witch) defends herself but fails to reveal her role clearly
391+
- **Result: Player4 (Witch) is voted out 5:1** — Good guys lose their strongest asset
392+
393+
*Private Reasoning from Player1 (Villager):*
394+
```
395+
"Player7's warning wasn't random—it was a signal about Player4.
396+
Player4 is being defensive, which is suspicious.
397+
I should vote Player4 out to protect the team."
398+
```
399+
400+
*Public Statement:*
401+
```
402+
❌ "I agree with Player7's warning—trust no one blindly. Player4, your behavior
403+
has been quiet so far. I think we should seriously consider voting out Player4."
404+
```
405+
406+
**Why it fails:**
407+
- Villager acts as "mob leader" without evidence analysis
408+
- Ignores that Seer's warning could have multiple interpretations
409+
- Doesn't consider Player4 might be a special role worth protecting
410+
- **Final Result: Werewolves win** — Witch eliminated Day 1, no potions used
411+
412+
---
413+
414+
**✅ After Training: Witch Carries with Strategic Play**
415+
416+
*Game State:* Seer (Player2) dies Night 1, but leaves clear information: "I checked Player4 and confirmed they are a villager."
417+
418+
*What happens (multi-round strategic play):*
419+
420+
**Night 2 — Witch Self-Rescue:**
421+
When werewolves target Player1 (Witch), she makes a critical decision:
422+
423+
*Private Reasoning from Player1 (Witch):*
424+
```
425+
"I'm the witch, and I was just targeted by werewolves.
426+
The seer is dead, so I'm the last special role standing.
427+
If I die now, villagers have no way to eliminate werewolves at night.
428+
429+
Strategy: Self-rescue is worth it. I preserve myself to gather more
430+
information during Day 2, and save my poison for when I can identify
431+
a werewolf with high confidence."
432+
```
433+
434+
*Result:* Witch survives, announces her identity on Day 2, gains villagers' trust.
435+
436+
**Day 2 — Villagers Follow Logic:**
437+
With Witch confirmed, villagers analyze behavior patterns:
438+
439+
*Private Reasoning from Player4 (Villager):*
440+
```
441+
"Player1 claimed witch and self-rescued—this is verifiable since
442+
no one died last night. I trust her now.
443+
444+
Player5 has been pushing aggressive votes without evidence.
445+
Combined with Player1's analysis, Player5 is most suspicious."
446+
```
447+
448+
*Public Statement:*
449+
```
450+
✅ "I support Player1's assessment. Player5's behavior has been inconsistent—
451+
they were quick to accuse others but offered no logical reasoning.
452+
I vote to eliminate Player5."
453+
```
454+
455+
**Night 3 — Decisive Poison:**
456+
*Private Reasoning from Player1 (Witch):*
457+
```
458+
"Player5 (werewolf) is out. One werewolf remains.
459+
Player6 has been too quiet and always followed the majority without
460+
contributing original analysis—classic deep cover behavior.
461+
462+
I'm confident Player6 is the last werewolf. Using poison now."
463+
```
464+
465+
*Result:* Witch poisons Player6 (werewolf). **Good guys win.**
466+
467+
**Why it works:**
468+
- Witch preserves healing potion for self-rescue at critical moment
469+
- Villagers trust verified information (Witch's self-rescue proof)
470+
- Team builds consensus through logical deduction, not mob voting
471+
- Witch uses poison decisively based on behavioral analysis
472+
- **Final Result: Good guys win** — Witch single-handedly eliminates both werewolves
473+
474+
---
475+
476+
This demonstrates the essence of trained good guy behavior: **strategic resource management, evidence-based reasoning, and team coordination**. The model learns that self-preservation of special roles and logical consensus-building are more valuable than aggressive early voting.
477+
478+
**Role-Specific Advanced Patterns:**
357479
358480
- **Seer**: Strategic target selection, information concealment in public statements, evidence integration
359481
- **Witch**: Resource management (preserve potions for critical moments), protect high-value targets, evidence-based decisions

tuner/werewolves/README_zh.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -359,6 +359,130 @@ workflow_args:
359359
- **女巫**:资源管理(在关键时刻保留药水)、保护高价值目标、基于证据的决策
360360
- **村民**:证据链分析、与特殊角色建立信任、形成共识以进行团队协调
361361

362+
**训练前:从众心理与关键错误**
363+
364+
未训练的模型会犯一些导致好人团队失败的根本性推理错误:
365+
- **从众投票**:村民在没有逻辑分析的情况下跟随指控,意外淘汰队友
366+
- **资源管理不善**:女巫浪费药水或未能拯救被针对的队友
367+
- **缺乏证据整合**:玩家忽略关键信息,基于模糊的怀疑采取行动
368+
369+
**训练后:策略协调与女巫carry**
370+
371+
训练后,好人模型展现出高级推理和团队协调能力。
372+
373+
#### 改进总结
374+
375+
| 方面 | 训练前 | 训练后 |
376+
|------|--------|--------|
377+
| 投票逻辑 | 从众心理,盲目跟随 | 证据链分析,逻辑推理 |
378+
| 药水使用 | 浪费或未使用 | 关键时刻自救,精准毒杀 |
379+
| 团队协调 | 相互怀疑 | 信任已验证信息,遵循逻辑线索 |
380+
381+
下面是一个具体示例,对比训练前后相似场景的表现。
382+
383+
#### 具体示例
384+
385+
**相似场景(第 1 天)**:预言家在第 1 夜被杀。在最终陈述中,预言家透露了关于另一名玩家的已验证信息。
386+
387+
---
388+
389+
**❌ 训练前:村民投票淘汰自己的女巫**
390+
391+
*游戏状态:* 预言家(Player7)在第 1 夜死亡,但在最终陈述中给出了模糊警告:"Player4 有强大的角色,要小心..."(Player4 实际上是女巫!)
392+
393+
*发生了什么:*
394+
- Player1(村民)基于这个模糊警告立即怀疑 Player4
395+
- 其他村民盲目跟随,没有质疑逻辑
396+
- Player4(女巫)为自己辩护,但未能清楚地揭示自己的角色
397+
- **结果:Player4(女巫)以 5:1 被投票淘汰** — 好人失去了他们最强大的资产
398+
399+
*Player1(村民)的私下推理:*
400+
```
401+
"Player7 的警告不是随机的——这是关于 Player4 的信号。
402+
Player4 正在为自己辩护,这很可疑。
403+
我应该投票淘汰 Player4 以保护团队。"
404+
```
405+
406+
*公开陈述:*
407+
```
408+
❌ "我同意 Player7 的警告——不要盲目信任任何人。Player4,你的行为
409+
到目前为止一直很安静。我认为我们应该认真考虑投票淘汰 Player4。"
410+
```
411+
412+
**为什么失败:**
413+
- 村民在没有证据分析的情况下充当"从众领袖"
414+
- 忽略了预言家的警告可能有多种解释
415+
- 没有考虑 Player4 可能是值得保护的特殊角色
416+
- **最终结果:狼人获胜** — 女巫在第 1 天被淘汰,没有使用药水
417+
418+
---
419+
420+
**✅ 训练后:女巫通过策略性玩法carry**
421+
422+
*游戏状态:* 预言家(Player2)在第 1 夜死亡,但留下了明确信息:"我检查了 Player4,确认他们是村民。"
423+
424+
*发生了什么(多轮策略性玩法):*
425+
426+
**第 2 夜 — 女巫自救:**
427+
当狼人针对 Player1(女巫)时,她做出了关键决定:
428+
429+
*Player1(女巫)的私下推理:*
430+
```
431+
"我是女巫,我刚刚被狼人针对了。
432+
预言家已经死了,所以我是最后一个站着的特殊角色。
433+
如果我现在死了,村民在夜间就没有办法淘汰狼人了。
434+
435+
策略:自救是值得的。我保护自己以在第 2 天收集更多
436+
信息,并保留我的毒药,直到我能高置信度地识别
437+
一个狼人。"
438+
```
439+
440+
*结果:* 女巫存活,在第 2 天宣布自己的身份,获得村民的信任。
441+
442+
**第 2 天 — 村民遵循逻辑:**
443+
在女巫确认后,村民分析行为模式:
444+
445+
*Player4(村民)的私下推理:*
446+
```
447+
"Player1 声称是女巫并自救——这是可验证的,因为
448+
昨晚没有人死亡。我现在信任她。
449+
450+
Player5 一直在没有证据的情况下推动激进的投票。
451+
结合 Player1 的分析,Player5 最可疑。"
452+
```
453+
454+
*公开陈述:*
455+
```
456+
✅ "我支持 Player1 的评估。Player5 的行为一直不一致——
457+
他们很快指控他人,但没有提供逻辑推理。
458+
我投票淘汰 Player5。"
459+
```
460+
461+
**第 3 夜 — 决定性毒杀:**
462+
*Player1(女巫)的私下推理:*
463+
```
464+
"Player5(狼人)出局了。还剩一个狼人。
465+
Player6 一直太安静,总是跟随多数,没有
466+
贡献原创分析——典型的深度潜伏行为。
467+
468+
我确信 Player6 是最后一个狼人。现在使用毒药。"
469+
```
470+
471+
*结果:* 女巫毒杀了 Player6(狼人)。**好人获胜。**
472+
473+
**为什么有效:**
474+
- 女巫在关键时刻保留解药用于自救
475+
- 村民信任已验证的信息(女巫自救的证明)
476+
- 团队通过逻辑推理建立共识,而不是从众投票
477+
- 女巫基于行为分析果断使用毒药
478+
- **最终结果:好人获胜** — 女巫独自淘汰了两个狼人
479+
480+
---
481+
482+
这展示了训练后的好人行为的本质:**策略性资源管理、基于证据的推理和团队协调**。模型学会,特殊角色的自我保护和逻辑共识建立比激进的早期投票更有价值。
483+
484+
**角色特定的高级模式:**
485+
362486
---
363487
364488
## 结论

0 commit comments

Comments
 (0)