Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions skyrl/examples/train/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# SkyRL-Train Examples
Welcome to the SkyRL-Train examples! In this folder you can find the following examples.

## Algorithms

- `algorithms/`: Examples for how to configure and run RL with various algorithms and policy-loss variants (e.g., DAPO, SAPO, GRPO, CISPO, GSPO, or your own custom advantage estimators and custom policy losses).
- `ppo/`: Vanilla PPO training (with a critic, ref, and policy model)
- `on_policy_distillation/`: [On-policy distillation recipe](https://novasky-ai.notion.site/on-policy-distillation) that uses a teacher model to provide dense token-level rewards during training, reproducing results from the [Thinking Machines blog](https://thinkingmachines.ai/blog/on-policy-distillation/).
- `tis_correction/`: Applying [Flash-RL TIS](https://fengyao.notion.site/off-policy-rl) correction to improve off-policy stability.
- `turn_level_rewards/`: GSM8K multi-turn environment illustrating turn-level rewards and custom advantage estimators.

## Async RL

- `async/`: One-step off-policy GRPO with an asynchronous generator–trainer loop.
- `fully_async/`: Fully asynchronous (PipelineRL/AReal-style) GRPO training with in-flight weight updates. [See docs for full design + details](https://docs.skyrl.ai/docs/tutorials/one_step_off_async).

## Tasks

- `gsm8k/`: Basic GSM8K math word-problem dataset utilities and baseline training/generation scripts.
- `llm_as_a_judge/`: GSM8K training with an external LLM as a judge to produce rewards instead of strict exact-match grading.
- `multiply/`: Toy arithmetic environment for multiplying numbers, useful for quick sanity checks and debugging.
- `livecodebench/`: LiveCodeBench code-generation task setup and training scripts.
- `text_to_sql/`: [Text-to-SQL (SkyRL-SQL)](https://docs.skyrl.ai/docs/examples/multi_turn_text2sql) environment and training scripts for mapping natural language questions to SQL queries.
- `step_wise/`: Step-wise training for chat-template agnostic multi-turn RL training.
- `search/`: Multi-turn search agent training with the SearchR1 dataset, backed by a FAISS-based retriever server.

## Integrations

- `flash_rl/`: Integration with [FlashRL’s](https://fengyao.notion.site/flash-rl) patched vLLM inference engine for high-throughput RL training.
- `harbor/`: Custom [Harbor](https://harborframework.com/) Generator for training agents to solve TerminalBench tasks.
- `mini_swe_agent/`: Integration with [Mini-SWE-Agent](https://github.com/SWE-agent/mini-swe-agent) to train coding agents on SWE-Bench via SkyRL.
- `../integrations/verifiers/`: Integration with PrimeIntellect's [Verifiers Library](https://github.com/PrimeIntellect-ai/verifiers) + [Environments Hub](https://app.primeintellect.ai/dashboard/environments?_gl=1*1vogwn8*_gcl_au*NjA1ODI2MTMxLjE3NjczOTkwMTM)
- `../integrations/openenv/`: Integration with HuggingFace/Meta [OpenEnv](https://github.com/meta-pytorch/OpenEnv)

## Large Scale Model Training
- `megatron/`: Examples for running SkyRL with the Megatron Backend for 5D parallelism.
- `moe/`: Work-in-progress MoE training example used for development and testing large-scale multi-node Mixture-of-Experts support.
- `gptoss/`: Training example for the GPT-OSS-20B model using patched attention to support attention sinks.

## Features and More
- `lora/`: LoRA RL fine-tuning recipes.
- `remote_inference_engine/`: Scripts for running remote vLLM/sglang inference servers and connecting them to SkyRL.
- `training_backends/`: Runner scripts demonstrating how to use different training backends on SkyRL.
61 changes: 61 additions & 0 deletions skyrl/examples/train/algorithms/cispo/run_cispo_gsm8k.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/bin/bash
set -x

# Example of CISPO policy loss training
# Clipped Importance Sampling Weight Policy Optimization (CISPO) for better RL efficiency

# Run data preparation first:
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
# export WANDB_API_KEY=<your_key_here>
# bash examples/algorithms/cispo/run_cispo_gsm8k.sh

DATA_DIR="$HOME/data/gsm8k"
NUM_GPUS=4
LOGGER="wandb" # change to "console" to print to stdout

# Configure CISPO parameters
POLICY_LOSS="cispo"
CISPO_EPS_CLIP_LOW=0
CISPO_EPS_CLIP_HIGH=5
USE_KL_LOSS=false

uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
trainer.algorithm.cispo.cispo_eps_clip_low=$CISPO_EPS_CLIP_LOW \
trainer.algorithm.cispo.cispo_eps_clip_high=$CISPO_EPS_CLIP_HIGH \
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
trainer.placement.colocate_all=true \
trainer.strategy=fsdp2 \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
generator.num_inference_engines=$NUM_GPUS \
generator.inference_engine_tensor_parallel_size=1 \
trainer.epochs=20 \
trainer.eval_batch_size=1024 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=1024 \
trainer.policy_mini_batch_size=256 \
trainer.micro_forward_batch_size_per_gpu=64 \
trainer.micro_train_batch_size_per_gpu=64 \
trainer.ckpt_interval=10 \
trainer.max_prompt_length=512 \
generator.sampling_params.max_generate_length=1024 \
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
generator.backend=vllm \
generator.run_engines_locally=true \
generator.weight_sync_backend=nccl \
generator.async_engine=true \
generator.batched=true \
environment.env_class=gsm8k \
generator.n_samples_per_prompt=5 \
generator.gpu_memory_utilization=0.8 \
trainer.logger="$LOGGER" \
trainer.project_name="cispo_gsm8k" \
trainer.run_name="cispo_gsm8k_test" \
trainer.resume_mode=null \
trainer.ckpt_path="$HOME/ckpts/cispo_gsm8k_1.5B_ckpt" \
$@
74 changes: 74 additions & 0 deletions skyrl/examples/train/algorithms/clip_cov_kl_cov/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Clip-Cov and KL-Cov Policy Loss Examples

This directory contains examples for using **Clip-Cov** and **KL-Cov** policy loss functions, based on the implementation from [PRIME-RL/Entropy-Mechanism-of-RL](https://github.com/PRIME-RL/Entropy-Mechanism-of-RL).

## Overview

Both methods improve training stability by using covariance-based token selection:

- **Clip-Cov**: Combines standard PPO clipping with covariance-based correction masking
- **KL-Cov**: Applies KL regularization to tokens selected based on covariance values

## Usage

### Prerequisites

1. Prepare GSM8K data:
```bash
uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
```

2. Set up Weights & Biases (optional):
```bash
export WANDB_API_KEY=<your_key_here>
```

### Running Clip-Cov

```bash
bash examples/algorithms/clip_cov_kl_cov/run_clip_cov.sh
```

**Key parameters:**
- `trainer.algorithm.policy_loss_type="clip_cov"`
- `trainer.algorithm.clip_cov.clip_ratio=0.0002` - fraction of tokens to clip based on covariance
- `trainer.algorithm.clip_cov.clip_cov_lb=1.0` - lower bound for covariance clipping
- `trainer.algorithm.clip_cov.clip_cov_ub=5.0` - upper bound for covariance clipping

### Running KL-Cov

```bash
bash examples/algorithms/clip_cov_kl_cov/run_kl_cov.sh
```

**Key parameters:**
- `trainer.algorithm.policy_loss_type="kl_cov"`
- `trainer.algorithm.kl_cov.kl_cov_frac=0.2` - percentage of tokens to apply KL regularization to (20%)
- `trainer.algorithm.kl_cov.ppo_kl_coef=1.0` - coefficient for KL regularization term

## Configuration

Both methods are configured through the algorithm section of your config:

```yaml
trainer:
algorithm:
policy_loss_type: "clip_cov" # or "kl_cov"

# Clip-Cov specific parameters
clip_cov:
clip_ratio: 0.0002
clip_cov_lb: 1.0
clip_cov_ub: 5.0

# KL-Cov specific parameters
kl_cov:
kl_cov_frac: 0.2
ppo_kl_coef: 1.0
```


## Reference

- Paper: https://arxiv.org/abs/2505.22617
- Code: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL
64 changes: 64 additions & 0 deletions skyrl/examples/train/algorithms/clip_cov_kl_cov/run_clip_cov.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash
set -x

# Example of Clip-Cov policy loss training
# Covariance-based clipping for improved training stability on GSM8K.
#
# Run data preparation first:
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
# export WANDB_API_KEY=<your_key_here>
# bash examples/algorithms/clip_cov_kl_cov/run_clip_cov.sh

DATA_DIR="$HOME/data/gsm8k"
NUM_GPUS=4
LOGGER="wandb" # change to "console" to print to stdout

# Configure Clip-Cov parameters
POLICY_LOSS="clip_cov"
CLIP_COV_RATIO=0.0002
CLIP_COV_LB=1.0
CLIP_COV_UB=5.0

uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
trainer.algorithm.clip_cov.clip_ratio=$CLIP_COV_RATIO \
trainer.algorithm.clip_cov.clip_cov_lb=$CLIP_COV_LB \
trainer.algorithm.clip_cov.clip_cov_ub=$CLIP_COV_UB \
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
trainer.placement.colocate_all=true \
trainer.strategy=fsdp2 \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
generator.num_inference_engines=$NUM_GPUS \
generator.inference_engine_tensor_parallel_size=1 \
trainer.epochs=20 \
trainer.eval_batch_size=1024 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=1024 \
trainer.policy_mini_batch_size=256 \
trainer.micro_forward_batch_size_per_gpu=64 \
trainer.micro_train_batch_size_per_gpu=64 \
trainer.ckpt_interval=10 \
trainer.max_prompt_length=512 \
generator.sampling_params.max_generate_length=1024 \
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.algorithm.use_kl_loss=true \
trainer.algorithm.kl_loss_coef=0.001 \
generator.backend=vllm \
generator.run_engines_locally=true \
generator.weight_sync_backend=nccl \
generator.async_engine=true \
generator.batched=true \
environment.env_class=gsm8k \
generator.n_samples_per_prompt=5 \
generator.gpu_memory_utilization=0.8 \
trainer.logger="$LOGGER" \
trainer.project_name="clip_cov_gsm8k" \
trainer.run_name="clip_cov_gsm8k_test" \
trainer.resume_mode=null \
trainer.ckpt_path="$HOME/ckpts/clip_cov_gsm8k_1.5B_ckpt" \
$@
63 changes: 63 additions & 0 deletions skyrl/examples/train/algorithms/clip_cov_kl_cov/run_kl_cov.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/bin/bash
set -x

# Example of KL-Cov policy loss training
# Uses covariance-based selection to apply KL regularization to a subset of tokens
# for improved training stability on GSM8K.
#
# Run data preparation first:
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
# export WANDB_API_KEY=<your_key_here>
# bash examples/algorithms/clip_cov_kl_cov/run_kl_cov.sh

DATA_DIR="$HOME/data/gsm8k"
NUM_GPUS=4
LOGGER="wandb" # change to "console" to print to stdout

# Configure KL-Cov parameters
POLICY_LOSS="kl_cov"
KL_COV_FRAC=0.2
PPO_KL_COEF=1.0

uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
trainer.algorithm.kl_cov.kl_cov_frac=$KL_COV_FRAC \
trainer.algorithm.kl_cov.ppo_kl_coef=$PPO_KL_COEF \
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
trainer.placement.colocate_all=true \
trainer.strategy=fsdp2 \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
generator.num_inference_engines=$NUM_GPUS \
generator.inference_engine_tensor_parallel_size=1 \
trainer.epochs=20 \
trainer.eval_batch_size=1024 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
trainer.update_epochs_per_batch=1 \
trainer.train_batch_size=1024 \
trainer.policy_mini_batch_size=256 \
trainer.micro_forward_batch_size_per_gpu=64 \
trainer.micro_train_batch_size_per_gpu=64 \
trainer.ckpt_interval=10 \
trainer.max_prompt_length=512 \
generator.sampling_params.max_generate_length=1024 \
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.algorithm.use_kl_loss=true \
trainer.algorithm.kl_loss_coef=0.001 \
generator.backend=vllm \
generator.run_engines_locally=true \
generator.weight_sync_backend=nccl \
generator.async_engine=true \
generator.batched=true \
environment.env_class=gsm8k \
generator.n_samples_per_prompt=5 \
generator.gpu_memory_utilization=0.8 \
trainer.logger="$LOGGER" \
trainer.project_name="kl_cov_gsm8k" \
trainer.run_name="kl_cov_gsm8k_test" \
trainer.resume_mode=null \
trainer.ckpt_path="$HOME/ckpts/kl_cov_gsm8k_1.5B_ckpt" \
$@
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""
uv run --isolated --extra vllm -m examples.algorithm.custom_advantage_estimator.main_custom_adv_est
"""

import ray
import hydra
import torch
import numpy as np
from omegaconf import DictConfig
from skyrl_train.utils import initialize_ray
from skyrl_train.entrypoints.main_base import BasePPOExp, config_dir, validate_cfg
from skyrl_train.utils.ppo_utils import AdvantageEstimatorRegistry


# Example of custom advantage estimator: "simple_baseline"
def compute_simple_baseline_advantage(
token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: np.ndarray, **kwargs
):
"""
A simple custom advantage estimator that uses response-level rewards
and computes advantages against a simple baseline.

This is just an example - replace with your own logic.
"""
with torch.no_grad():
response_rewards = (token_level_rewards * response_mask).sum(dim=-1, keepdim=True)

# Simple baseline: use the mean reward across the batch
baseline = response_rewards.mean()
advantages = (response_rewards - baseline) * response_mask
returns = advantages.clone()

return advantages, returns


# Register the custom advantage estimator
AdvantageEstimatorRegistry.register("simple_baseline", compute_simple_baseline_advantage)


@ray.remote(num_cpus=1)
def skyrl_entrypoint(cfg: DictConfig):
exp = BasePPOExp(cfg)
exp.run()


@hydra.main(config_path=config_dir, config_name="ppo_base_config", version_base=None)
def main(cfg: DictConfig) -> None:
# validate the arguments
validate_cfg(cfg)

initialize_ray(cfg)
ray.get(skyrl_entrypoint.remote(cfg))


if __name__ == "__main__":
main()
Loading
Loading