This guide shows how to quickly run training and how to customize everything from basic runtime knobs to fine‑grained agent, environment, and PPO settings.
- Make sure dependencies are installed and you can access your target model (e.g.,
Qwen/Qwen2.5-0.5B-Instruct). - From the project root, run the training script:
./quick_train_qwen_halfb.sh.
That’s it for the first run. Next, tailor the run by editing a few variables directly inside the script.
Open quick_train_qwen_halfb.sh and adjust the variables at the top of the file. These are the most commonly tuned parameters:
- CUDA_VISIBLE_DEVICES: Which GPUs to use (e.g.,
"0","0,1"). - AGENT_GROUP_NUM: Number of training groups. Groups run in parallel and each group shares a single random seed.
- AGENT_GROUP_SIZE: Number of agents per group. Total training instances per rollout step =
AGENT_GROUP_NUM × AGENT_GROUP_SIZE. - TRAINING_TASKS: Comma‑separated list of task names defined in
configs/agents.yaml. - VALIDATION_TASKS: Comma‑separated list of validation task names (must match the lengths of validation group lists below).
- VALIDATION_AGENT_GROUP_NUM: Comma‑separated per‑validation‑task group counts.
- VALIDATION_AGENT_GROUP_SIZE: Comma‑separated per‑validation‑task group sizes.
- N_GPUS_PER_NODE: Number of GPUs per node used by the trainer.
- MODEL_PATH: Model identifier or local path (e.g.,
Qwen/Qwen2.5-0.5B-Instruct). - Optional labels: PROJECT_NAME, EXPERIMENT_NAME.
Examples (edit values directly inside the script):
- Train with 8×16: set
AGENT_GROUP_NUM=8,AGENT_GROUP_SIZE=16. - Validate with 64×1: set
VALIDATION_AGENT_GROUP_NUM="64",VALIDATION_AGENT_GROUP_SIZE="1". - Multiple validation tasks: e.g.,
VALIDATION_TASKS="simpleSokobanAgent,largeSokobanAgent", then setVALIDATION_AGENT_GROUP_NUM="64,64"andVALIDATION_AGENT_GROUP_SIZE="1,1"(same length and order as tasks). - Switch model: set
MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct"(or any compatible model you have access to). - Choose GPUs: set
CUDA_VISIBLE_DEVICES="0,1"(ensureN_GPUS_PER_NODEis consistent with how many GPUs you intend to use).
Notes:
- Random seeds: the same group shares the same seed; different groups get different seeds.
- Validation group arrays must align with
VALIDATION_TASKSone‑to‑one.
Tasks are defined by a pair of sections:
- agent_config: Defines the task lifecycle and agent behavior (system/prompt, turn limits, formatting, action separators, etc.).
- env_config: Defines environment hyperparameters (difficulty, sizes, time limits, vocab lookups, dataset paths, render mode, etc.).
Create your own task by adding a new entry keyed by a unique name. Example (trimmed):
mySokobanTiny:
agent_type: "sokobanAgent"
agent_config:
system_prompt: "You are a helpful AI assistant that solves Sokoban puzzles step by step."
prompt: "You are solving the Sokoban puzzle... <answer>Right || Up</answer>"
enable_think: true
max_turns: 6
max_actions_per_turn: 5
max_actions_all_turns: 10
action_separator: "||"
env_config:
dim_room: [6, 6]
num_boxes: 1
max_steps: 100
render_mode: "text"Then reference mySokobanTiny inside TRAINING_TASKS or VALIDATION_TASKS in your script. You can mix different agents (e.g., sokobanAgent, tetrisAgent, blocksworldAgent, gsm8kAgent, webshopAgent, birdAgent) and provide distinct env_config for each.
To add a new agent/env implementation:
- Implement your logic following the base interfaces in
grl/agents/base_agent.pyandgrl/agents/base_env.py. - Add a new keyed entry in
configs/agents.yamlwith youragent_type,agent_config, andenv_config. - Use that key in
TRAINING_TASKS/VALIDATION_TASKSinside the script.
This design gives you high freedom to customize both the agent’s lifecycle and the environment’s dynamics without changing the training loop.
For deeper control beyond the script’s convenience knobs, edit configs/base.yaml:
- rollout: truncation method, rollout filtering (
rollout_filter_ratio,rollout_filter_type), reward normalization (grouping,method), validation settings. - data:
max_prompt_length,max_response_length,train_batch_size. - algorithm: PPO/GAE settings (
gamma,lam,adv_estimator, KL penalty and coefficient). - actor_rollout_ref: model path, whether to use reference model, entropy coefficient, KL settings, clipping ranges, micro‑batch sizes, rollout engine options.
- critic: critic model path and optimizer settings.
- trainer: total steps, validation cadence (
validation_steps,val_before_train), logging, test frequency, save frequency, resume behavior, GPUs per node. - gpu_memory_utilization and max_model_len under rollout engine if needed.
- model_path here acts as a project default; the script can override it at runtime.
Tip: Keep defaults in base.yaml for reproducibility, and only override per‑run values in the script when experimenting.
Shortcuts for visualization and CPU parallelism:
- Set
rollout.show_tqdm: Trueto see progress bars for concurrent agent steps (prompts/tokenize/env/agents). - Optional knobs (0 = auto):
rollout.num_prompt_threads,rollout.num_env_threads,rollout.num_init_threads. - Example:
rollout: show_tqdm: True num_prompt_threads: 0 num_env_threads: 0 num_init_threads: 0
- Edit
quick_train_qwen_halfb.shvariables for tasks, model, GPUs, and group sizes. - Optionally add/modify tasks in
configs/agents.yamlto change agent behavior or environment difficulty. - Optionally tune training/rollout details in
configs/base.yamlfor PPO and system behavior. - Run
./quick_train_qwen_halfb.sh.
You now have a sharp, easy workflow: quick edits in the script for day‑to‑day runs, and high‑freedom customization via agents.yaml and base.yaml when you need finer control.