Skip to content

Commit b6348b4

Browse files
committed
add examples for frozenlake and emailsearch
1 parent fe4b7f5 commit b6348b4

16 files changed

+2827
-0
lines changed

training/email_search/README.md

Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# Training Email Search Agent with RL using AgentScope-Tuner
2+
3+
This example demonstrates how to implement reinforcement fine-tuning for the Email Search task (inspired by [ART](https://openpipe.ai/blog/art-e-mail-agent)) using AgentScope-Tuner, whose RFT functionality is backed by [Trinity-RFT](https://github.com/modelscope/Trinity-RFT).
4+
5+
## Task Setting
6+
7+
The agent's goal is to answer user queries by searching through an email inbox. The agent needs to:
8+
- Understand the user's question
9+
- Search for relevant emails using keywords
10+
- Read email contents to extract information
11+
- Provide accurate answers with proper source citations
12+
13+
**Agent Type**: The agent (`EmailSearchAgent`) extends `ReActAgent`, which follows a reasoning-acting loop to solve tasks iteratively.
14+
15+
**Environment**: The environment is a SQLite database containing emails from the Enron Email dataset. Each task provides:
16+
- `question`: The user's email search query
17+
- `inbox_address`: The email inbox to search
18+
- `query_date`: The date context for the query
19+
- `answer`: The expected answer (ground truth), only for reward calculation
20+
- `message_ids`: IDs of relevant emails containing the answer, only for reward calculation
21+
22+
**Available Tools**:
23+
- `search_emails`: Find emails by keywords, inbox address, and date range. Returns a list of email summaries (message_id and snippet).
24+
- `read_email`: Read the full content of a specific email by message_id.
25+
- `generate_response`: Provide the final structured answer with sources (inherited from ReAct agent).
26+
27+
## Dataset Preparation
28+
29+
The dataset contains email queries based on the [Enron Email dataset](https://huggingface.co/datasets/corbt/enron-emails). Run the data preparation script to generate the email database and datasets:
30+
31+
```bash
32+
python prepare_data.py
33+
```
34+
35+
If you want to choose a new database path, you can modify the `DEFAULT_DB_PATH` in [`prepare_data.py`]. Also, remember to set an environment variable `DEFAULT_EMAIL_DB_PATH` to point to the database path before moving to the next step:
36+
37+
```bash
38+
export DEFAULT_EMAIL_DB_PATH=/path/to/enron_emails_dataset/data/enron_emails.db
39+
```
40+
41+
This will create a SQLite database and datasets:
42+
43+
```
44+
/path/to/enron_emails_dataset/
45+
├── data
46+
└── enron_emails.db # Email database
47+
├── train.parquet # Training samples
48+
└── test.parquet # Test samples
49+
```
50+
51+
Each sample looks like:
52+
53+
```json
54+
{
55+
"id": 0,
56+
"question": "Were there any variances detected for hour 6 on 3/9/01?",
57+
"answer": "Yes, variances were detected in both Generation and Energy Import/Export schedules for hour 6 on 3/9/01.",
58+
"message_ids": ["<17407857.1075840601283.JavaMail.evans@thyme>"],
59+
"how_realistic": 0.800000011920929,
60+
"inbox_address": "pete.davis@enron.com",
61+
"query_date": "2001-03-16"
62+
}
63+
```
64+
65+
## Code Implementation
66+
67+
This section provides a high-level overview of the code implementation. For detailed implementation, please refer to the source code.
68+
69+
### Agent Workflow
70+
71+
The workflow function `run_email_search_agent` implements the agent-environment interaction loop:
72+
73+
```python
74+
async def run_email_search_agent(
75+
task: Dict,
76+
model: TunerChatModel,
77+
auxiliary_models: Dict[str, TunerChatModel],
78+
) -> WorkflowOutput:
79+
# Parse task and create agent
80+
agent = EmailSearchAgent(
81+
name="email_search_agent",
82+
sys_prompt=system_prompt,
83+
model=model,
84+
max_iters=max_turns,
85+
)
86+
87+
# Run the agent with structured output
88+
response = await agent.reply(
89+
msg=Msg("user", question, role="user"),
90+
structured_model=AnswerModel,
91+
)
92+
93+
return WorkflowOutput(response=response)
94+
```
95+
96+
The agent follows a ReAct pattern: it reasons about the task, calls tools to search and read emails, and finally generates a structured response containing the answer and source message IDs.
97+
98+
### Judge Function
99+
100+
The judge function `email_search_judge` implements reward calculation using LLM-as-a-Judge:
101+
102+
```python
103+
async def email_search_judge(
104+
task: Dict,
105+
response: Msg,
106+
auxiliary_models: Dict[str, TunerChatModel],
107+
) -> JudgeOutput:
108+
# Extract answer and sources from response
109+
answer = answer_and_sources.get("answer")
110+
sources = answer_and_sources.get("sources", [])
111+
112+
# Judge correctness using LLM-as-a-Judge
113+
judge_model = auxiliary_models.get('judge') or list(auxiliary_models.values())[0]
114+
judge_response = await judge_correctness(
115+
answer, query, judge_model
116+
)
117+
118+
# Calculate reward based on:
119+
# - Answer correctness (accuracy: -1.0 to 1.0)
120+
# - Source correctness (format: partial rewards)
121+
# - Efficiency (bonus for fewer turns, correct sources)
122+
result = {"accuracy": ..., "format": ...} # calculated based on judge_response
123+
124+
return JudgeOutput(
125+
reward=sum(result.values()),
126+
metrics=metrics,
127+
)
128+
```
129+
130+
The reward function considers:
131+
- **Answer correctness**: Evaluated by LLM-as-a-Judge comparing the agent's answer with the ground truth
132+
- **Source correctness**: Whether the agent cited the correct email message IDs
133+
- **Efficiency**: Bonus rewards for finding/reading the correct email and taking fewer turns
134+
135+
See [`main.py`](./main.py) and [`email_search_agent.py`](./email_search_agent.py) for implementation details.
136+
137+
## How to Run
138+
139+
### Prerequisites
140+
141+
- At least 4 NVIDIA GPUs with CUDA 12.8 or newer
142+
* Note: For the 30B Judge model, you need to use a GPU with at least 4080 memory; you can also run the model on multiple GPUs by using `tensor_parallel_size > 1` to reduce the memory usage (by default, `tensor_parallel_size=2`).
143+
- Follow the Trinity-RFT [installation guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_installation.html) to install the latest version from source code
144+
- Download the model checkpoint (example):
145+
146+
```bash
147+
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507
148+
huggingface-cli download Qwen/Qwen3-30B-A3B-Instruct-2507 # judge model
149+
```
150+
151+
### Configuration
152+
153+
Adjust the configuration file ([`config.yaml`](./config.yaml)) based on your hardware. Key configuration sections include:
154+
155+
- **TunerChatModel**: Set `model_path` to your model checkpoint path
156+
- **Algorithm**: Configure RL algorithm parameters (e.g., `multi_step_grpo`, learning rate, policy loss function)
157+
- **Dataset**: The dataset path is specified in `main.py` when creating the `Dataset` object
158+
- **Auxiliary Models**: Configure judge model settings for LLM-as-a-Judge
159+
160+
For full configuration details, see [Trinity-RFT Configuration Guide](https://modelscope.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html).
161+
162+
### Start-Up Commands
163+
164+
1. Prepare the dataset:
165+
166+
```bash
167+
python prepare_data.py
168+
export DEFAULT_EMAIL_DB_PATH=/path/to/enron_emails_dataset/data/enron_emails.db
169+
```
170+
171+
2. Set up a [Ray](https://github.com/ray-project/ray) cluster:
172+
173+
```bash
174+
ray start --head
175+
```
176+
177+
3. Run the training script:
178+
179+
```bash
180+
python main.py
181+
```
182+
183+
## Experimental Results
184+
185+
### Quantitative Results
186+
187+
The training results show improvements in agent performance over training iterations. Key metrics include:
188+
189+
- **Train reward**: The average reward on training samples increases as the agent learns better strategies
190+
- **Rollout accuracy**: The average accuracy on rollout samples increases as the agent learns better strategies
191+
192+
![Training Rewards](./critic_reward_mean.png)
193+
194+
![Rollout Accuracy](./rollout_accuracy_mean.png)
195+
196+
197+
### Concrete Example
198+
199+
An example of the agent's behavior is shown below:
200+
201+
**Query:** "What do the color codes mean in the curve assessment?"
202+
203+
We show the last several turns of agent responses:
204+
205+
The agent performs multiple search attempts to find relevant emails. After some unsuccessful searches, the agent tries:
206+
207+
**Tool call:**
208+
```json
209+
{
210+
"type": "tool_use",
211+
"name": "search_emails",
212+
"input": {
213+
"inbox_address": "steven.kean@enron.com",
214+
"query_date": "2001-04-03",
215+
"keywords": ["curve", "assessment"]
216+
}
217+
}
218+
```
219+
220+
**Tool result:**
221+
```json
222+
{
223+
"type": "tool_result",
224+
"name": "search_emails",
225+
"output": [
226+
{
227+
"message_id": "<2654330.1075846153519.JavaMail.evans@thyme>",
228+
"snippet": "... curves and Senior Management as necessary.\\n\\n<b>Curve Assessment</b>\\n Green: \\n - curve is reasonable\\n - small P&L ..."
229+
},
230+
{
231+
"message_id": "<12499967.1075846153495.JavaMail.evans@thyme>",
232+
"snippet": "... curves and Senior Management as necessary.\\n\\n<b>Curve Assessment</b>\\n Green: \\n - curve is reasonable\\n - small P&L ..."
233+
}
234+
]
235+
}
236+
```
237+
238+
After finding relevant emails, the agent uses `read_email` tool to read the full content of both emails:
239+
240+
**Tool call:**
241+
```json
242+
{
243+
"type": "tool_use",
244+
"name": "read_email",
245+
"input": {
246+
"message_id": "<2654330.1075846153519.JavaMail.evans@thyme>"
247+
}
248+
}
249+
```
250+
251+
**Tool call:**
252+
```json
253+
{
254+
"type": "tool_use",
255+
"name": "read_email",
256+
"input": {
257+
"message_id": "<12499967.1075846153495.JavaMail.evans@thyme>"
258+
}
259+
}
260+
```
261+
262+
After reading the emails, the agent generates the final structured response:
263+
264+
**Tool call:**
265+
```json
266+
{
267+
"type": "tool_use",
268+
"name": "generate_response",
269+
"input": {
270+
"answer": "In the curve assessment, the color codes are used as follows:\n- Green: The curve is considered reasonable, and the P&L (profit and loss) is small.",
271+
"sources": [
272+
"<2654330.1075846153519.JavaMail.evans@thyme>",
273+
"<12499967.1075846153495.JavaMail.evans@thyme>"
274+
]
275+
}
276+
}
277+
```
278+
279+
The judge evaluates the answer as correct based on the ground truth.

0 commit comments

Comments
 (0)