Skip to content

Commit 0615352

Browse files
HJSangHejian Sanggemini-code-assist[bot]
authored
[recipe] feat: Add example for gpt-oss training using agent loop (verl-project#3774)
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test TODO: run training test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: Hejian Sang <hsang@linkedin.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent 55f651c commit 0615352

File tree

2 files changed

+190
-7
lines changed

2 files changed

+190
-7
lines changed

recipe/langgraph_agent/chat_model.py

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,20 @@
4343
logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
4444

4545

46+
def format_tool_response_manually(tool_message: dict, tool_call_name: str) -> str:
47+
"""Manually format tool response without using tokenizer template.
48+
49+
Args:
50+
tool_message: Tool message dictionary with 'content' field
51+
tool_call_name: Name of the tool that was called
52+
53+
Returns:
54+
Formatted tool response string
55+
"""
56+
content = tool_message["content"]
57+
return f"<|start|>functions.{tool_call_name} to=assistant<|channel|>commentary<|message|>{content}<|end|>"
58+
59+
4660
class MaxTokenExceededError(Exception):
4761
"""Indicate that history chat messages + tool message exceeds LLM max_tokens."""
4862

@@ -202,13 +216,39 @@ async def _preprocess(self, messages: list[BaseMessage], **kwargs: Any) -> tuple
202216

203217
# encode tool response
204218
tool_responses = convert_to_openai_messages(messages[i + 1 :])
205-
tool_response_ids = await loop.run_in_executor(
206-
None,
207-
lambda messages=tool_responses: self.tokenizer.apply_chat_template(
208-
messages, add_generation_prompt=True, tokenize=True
209-
),
210-
)
211-
tool_response_ids = tool_response_ids[len(kwargs["system_prompt"]) :]
219+
if self.tool_parser == "hermes":
220+
tool_response_ids = await loop.run_in_executor(
221+
None,
222+
lambda messages=tool_responses: self.tokenizer.apply_chat_template(
223+
messages, add_generation_prompt=True, tokenize=True
224+
),
225+
)
226+
tool_response_ids = tool_response_ids[len(kwargs["system_prompt"]) :]
227+
elif self.tool_parser == "gpt-oss":
228+
# Format tool responses manually
229+
# since gpt-oss chat template requires tool call messages to parse tool response messages
230+
# we need to format the tool response messages manually
231+
tool_response_texts = []
232+
for tool_msg in tool_responses:
233+
if tool_msg["role"] == "tool":
234+
# Use tool message's name if available (for multiple tool calls)
235+
actual_tool_name = tool_msg.get("name", "unknown")
236+
if actual_tool_name == "unknown":
237+
logger.error(f"actual_tool_name: {actual_tool_name}")
238+
formatted = format_tool_response_manually(tool_msg, actual_tool_name)
239+
tool_response_texts.append(formatted)
240+
# need to add generation tokens for gpt-oss manually since add_generation_prompt is True
241+
tool_response_texts.append("<|start|>assistant")
242+
243+
# Tokenize the manually formatted tool responses
244+
tool_response_text = "".join(tool_response_texts)
245+
print(f"tool_response_text: {tool_response_text}")
246+
247+
tool_response_ids = await loop.run_in_executor(
248+
None, lambda: self.tokenizer.encode(tool_response_text, add_special_tokens=False)
249+
)
250+
else:
251+
raise ValueError(f"Unsupported tool parser: {self.tool_parser}")
212252

213253
# stop generation if response length exceeds max response length
214254
if len(messages[i].response_metadata["response_mask"]) + len(tool_response_ids) >= self.max_tokens:
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
#!/usr/bin/env bash
2+
#SBATCH --job-name=rl-langgraph-3B
3+
#SBATCH --partition=main
4+
#SBATCH --nodes=1
5+
#SBATCH --ntasks-per-node=1
6+
#SBATCH --cpus-per-task=64
7+
#SBATCH --gres=gpu:4
8+
#SBATCH --mem=0
9+
#SBATCH --time=10:00:00
10+
#SBATCH --output=%x_%j.out
11+
#SBATCH --error=%x_%j.err
12+
13+
set -xeuo pipefail
14+
15+
# ================= cluster topology =================
16+
export GPUS_PER_NODE=${SLURM_GPUS_ON_NODE:-${GPUS_PER_NODE:-2}} # GPUs on this node
17+
NNODES=${SLURM_JOB_NUM_NODES:-${NNODES:-1}}
18+
export NNODES
19+
export RAY_NUM_NODES=$NNODES
20+
21+
# Require at least 2 GPUs
22+
TOTAL_GPUS=$((GPUS_PER_NODE * NNODES))
23+
if [ "$TOTAL_GPUS" -lt 2 ]; then
24+
echo "Error: at least 2 GPUs are required, detected $TOTAL_GPUS." >&2
25+
exit 1
26+
fi
27+
28+
echo "Using $NNODES nodes and $GPUS_PER_NODE GPUs per node..."
29+
30+
# ================= data/model/tool =================
31+
HDFS_ROOT=${HDFS_ROOT:-$PWD}
32+
DATA_ROOT=${DATA_ROOT:-$PWD}
33+
34+
# Prefer local model if present, otherwise fall back to HF hub path
35+
model_path="lmsys/gpt-oss-20b-bf16"
36+
37+
# Use the default output directory produced by create_dataset.py
38+
train_files=$DATA_ROOT/data/math_expression_tool/train.parquet
39+
test_files=$DATA_ROOT/data/math_expression_tool/test.parquet
40+
41+
# Agent config
42+
agent_loop_config_path=recipe/langgraph_agent/example/agent.yaml
43+
44+
# =================== wandb ===================
45+
project_name=math_expression_tool
46+
experiment_name=gpt-oss-20b-bf16
47+
default_local_dir=$DATA_ROOT/checkpoint/$experiment_name
48+
49+
# ================= algorithm =================
50+
adv_estimator=grpo
51+
52+
use_kl_in_reward=false
53+
kl_coef=0.0
54+
use_kl_loss=false
55+
kl_loss_coef=0.0
56+
57+
clip_ratio_low=0.2
58+
clip_ratio_high=0.28
59+
60+
max_turns=8
61+
max_prompt_length=1024
62+
max_response_length=8192
63+
actor_lr=1e-6
64+
65+
train_batch_size=128
66+
ppo_mini_batch_size=16
67+
n_resp_per_prompt=8
68+
n_resp_per_prompt_val=1
69+
70+
# =================== logging ===================
71+
export RAY_LOGGING_LEVEL=DEBUG
72+
export HYDRA_FULL_ERROR=1
73+
74+
# ================= performance =================
75+
export NCCL_IBEXT_DISABLE=1
76+
export NCCL_NVLS_ENABLE=1
77+
export NCCL_IB_HCA=mlx5
78+
export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
79+
export VLLM_USE_V1=1
80+
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
81+
82+
infer_tp=2 # vLLM tensor parallel size
83+
train_sp=4 # Ulysses sequence parallel size for actor
84+
offload=true
85+
86+
actor_max_token_len_per_gpu=$(( (max_prompt_length + max_response_length) * 4 ))
87+
log_prob_max_token_len_per_gpu=$(( actor_max_token_len_per_gpu * 2 ))
88+
89+
train_files="['$train_files']"
90+
test_files="['$test_files']"
91+
92+
python3 -m verl.trainer.main_ppo \
93+
algorithm.adv_estimator=$adv_estimator \
94+
algorithm.use_kl_in_reward=$use_kl_in_reward \
95+
algorithm.kl_ctrl.kl_coef=$kl_coef \
96+
data.train_files="$train_files" \
97+
data.val_files="$test_files" \
98+
data.return_raw_chat=true \
99+
data.train_batch_size=$train_batch_size \
100+
data.max_prompt_length=$max_prompt_length \
101+
data.max_response_length=$max_response_length \
102+
data.filter_overlong_prompts=true \
103+
data.truncation='error' \
104+
actor_rollout_ref.model.path="$model_path" \
105+
actor_rollout_ref.model.use_remove_padding=true \
106+
actor_rollout_ref.model.enable_gradient_checkpointing=true \
107+
actor_rollout_ref.actor.use_kl_loss=$use_kl_loss \
108+
actor_rollout_ref.actor.kl_loss_coef=$kl_loss_coef \
109+
actor_rollout_ref.actor.clip_ratio_low=$clip_ratio_low \
110+
actor_rollout_ref.actor.clip_ratio_high=$clip_ratio_high \
111+
actor_rollout_ref.actor.clip_ratio_c=10.0 \
112+
actor_rollout_ref.actor.optim.lr=$actor_lr \
113+
actor_rollout_ref.actor.use_dynamic_bsz=true \
114+
actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size \
115+
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$actor_max_token_len_per_gpu \
116+
actor_rollout_ref.actor.ulysses_sequence_parallel_size=$train_sp \
117+
actor_rollout_ref.actor.fsdp_config.param_offload=$offload \
118+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=$offload \
119+
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu \
120+
actor_rollout_ref.rollout.name=sglang \
121+
actor_rollout_ref.rollout.mode=async \
122+
actor_rollout_ref.rollout.tensor_model_parallel_size=$infer_tp \
123+
actor_rollout_ref.rollout.multi_turn.max_user_turns=$max_turns \
124+
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=$max_turns \
125+
actor_rollout_ref.rollout.multi_turn.format=gpt-oss \
126+
actor_rollout_ref.rollout.agent.tool_parser=gpt-oss \
127+
actor_rollout_ref.rollout.agent.agent_loop_config_path=$agent_loop_config_path \
128+
actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
129+
actor_rollout_ref.rollout.n=$n_resp_per_prompt \
130+
actor_rollout_ref.rollout.val_kwargs.top_p=1.0\
131+
actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \
132+
actor_rollout_ref.rollout.val_kwargs.n=$n_resp_per_prompt_val \
133+
trainer.logger='["console","wandb"]' \
134+
trainer.project_name=$project_name \
135+
trainer.experiment_name=$experiment_name \
136+
trainer.n_gpus_per_node="$GPUS_PER_NODE" \
137+
trainer.val_before_train=true \
138+
trainer.log_val_generations=50 \
139+
trainer.nnodes="$NNODES" \
140+
trainer.save_freq=-1 \
141+
trainer.default_local_dir="$default_local_dir" \
142+
trainer.test_freq=5 \
143+
trainer.total_epochs=1 "$@"

0 commit comments

Comments
 (0)