Skip to content

Commit 8c3bd9d

Browse files
authored
[migration] copy old docs, examples, integrations, scripts (#1133)
Copy over old docs, examples, integrations, scripts WIP to make sure all of these run against the new refactored code, will need to change import paths and test! cc: @CharlieFRuan <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1133" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
1 parent b4355c1 commit 8c3bd9d

205 files changed

Lines changed: 17486 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

skyrl/examples/train/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# SkyRL-Train Examples
2+
Welcome to the SkyRL-Train examples! In this folder you can find the following examples.
3+
4+
## Algorithms
5+
6+
- `algorithms/`: Examples for how to configure and run RL with various algorithms and policy-loss variants (e.g., DAPO, SAPO, GRPO, CISPO, GSPO, or your own custom advantage estimators and custom policy losses).
7+
- `ppo/`: Vanilla PPO training (with a critic, ref, and policy model)
8+
- `on_policy_distillation/`: [On-policy distillation recipe](https://novasky-ai.notion.site/on-policy-distillation) that uses a teacher model to provide dense token-level rewards during training, reproducing results from the [Thinking Machines blog](https://thinkingmachines.ai/blog/on-policy-distillation/).
9+
- `tis_correction/`: Applying [Flash-RL TIS](https://fengyao.notion.site/off-policy-rl) correction to improve off-policy stability.
10+
- `turn_level_rewards/`: GSM8K multi-turn environment illustrating turn-level rewards and custom advantage estimators.
11+
12+
## Async RL
13+
14+
- `async/`: One-step off-policy GRPO with an asynchronous generator–trainer loop.
15+
- `fully_async/`: Fully asynchronous (PipelineRL/AReal-style) GRPO training with in-flight weight updates. [See docs for full design + details](https://docs.skyrl.ai/docs/tutorials/one_step_off_async).
16+
17+
## Tasks
18+
19+
- `gsm8k/`: Basic GSM8K math word-problem dataset utilities and baseline training/generation scripts.
20+
- `llm_as_a_judge/`: GSM8K training with an external LLM as a judge to produce rewards instead of strict exact-match grading.
21+
- `multiply/`: Toy arithmetic environment for multiplying numbers, useful for quick sanity checks and debugging.
22+
- `livecodebench/`: LiveCodeBench code-generation task setup and training scripts.
23+
- `text_to_sql/`: [Text-to-SQL (SkyRL-SQL)](https://docs.skyrl.ai/docs/examples/multi_turn_text2sql) environment and training scripts for mapping natural language questions to SQL queries.
24+
- `step_wise/`: Step-wise training for chat-template agnostic multi-turn RL training.
25+
- `search/`: Multi-turn search agent training with the SearchR1 dataset, backed by a FAISS-based retriever server.
26+
27+
## Integrations
28+
29+
- `flash_rl/`: Integration with [FlashRL’s](https://fengyao.notion.site/flash-rl) patched vLLM inference engine for high-throughput RL training.
30+
- `harbor/`: Custom [Harbor](https://harborframework.com/) Generator for training agents to solve TerminalBench tasks.
31+
- `mini_swe_agent/`: Integration with [Mini-SWE-Agent](https://github.com/SWE-agent/mini-swe-agent) to train coding agents on SWE-Bench via SkyRL.
32+
- `../integrations/verifiers/`: Integration with PrimeIntellect's [Verifiers Library](https://github.com/PrimeIntellect-ai/verifiers) + [Environments Hub](https://app.primeintellect.ai/dashboard/environments?_gl=1*1vogwn8*_gcl_au*NjA1ODI2MTMxLjE3NjczOTkwMTM)
33+
- `../integrations/openenv/`: Integration with HuggingFace/Meta [OpenEnv](https://github.com/meta-pytorch/OpenEnv)
34+
35+
## Large Scale Model Training
36+
- `megatron/`: Examples for running SkyRL with the Megatron Backend for 5D parallelism.
37+
- `moe/`: Work-in-progress MoE training example used for development and testing large-scale multi-node Mixture-of-Experts support.
38+
- `gptoss/`: Training example for the GPT-OSS-20B model using patched attention to support attention sinks.
39+
40+
## Features and More
41+
- `lora/`: LoRA RL fine-tuning recipes.
42+
- `remote_inference_engine/`: Scripts for running remote vLLM/sglang inference servers and connecting them to SkyRL.
43+
- `training_backends/`: Runner scripts demonstrating how to use different training backends on SkyRL.
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
#!/bin/bash
2+
set -x
3+
4+
# Example of CISPO policy loss training
5+
# Clipped Importance Sampling Weight Policy Optimization (CISPO) for better RL efficiency
6+
7+
# Run data preparation first:
8+
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
9+
# export WANDB_API_KEY=<your_key_here>
10+
# bash examples/algorithms/cispo/run_cispo_gsm8k.sh
11+
12+
DATA_DIR="$HOME/data/gsm8k"
13+
NUM_GPUS=4
14+
LOGGER="wandb" # change to "console" to print to stdout
15+
16+
# Configure CISPO parameters
17+
POLICY_LOSS="cispo"
18+
CISPO_EPS_CLIP_LOW=0
19+
CISPO_EPS_CLIP_HIGH=5
20+
USE_KL_LOSS=false
21+
22+
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
23+
data.train_data="['$DATA_DIR/train.parquet']" \
24+
data.val_data="['$DATA_DIR/validation.parquet']" \
25+
trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
26+
trainer.algorithm.cispo.cispo_eps_clip_low=$CISPO_EPS_CLIP_LOW \
27+
trainer.algorithm.cispo.cispo_eps_clip_high=$CISPO_EPS_CLIP_HIGH \
28+
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
29+
trainer.placement.colocate_all=true \
30+
trainer.strategy=fsdp2 \
31+
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
32+
generator.num_inference_engines=$NUM_GPUS \
33+
generator.inference_engine_tensor_parallel_size=1 \
34+
trainer.epochs=20 \
35+
trainer.eval_batch_size=1024 \
36+
trainer.eval_before_train=true \
37+
trainer.eval_interval=5 \
38+
trainer.update_epochs_per_batch=1 \
39+
trainer.train_batch_size=1024 \
40+
trainer.policy_mini_batch_size=256 \
41+
trainer.micro_forward_batch_size_per_gpu=64 \
42+
trainer.micro_train_batch_size_per_gpu=64 \
43+
trainer.ckpt_interval=10 \
44+
trainer.max_prompt_length=512 \
45+
generator.sampling_params.max_generate_length=1024 \
46+
trainer.policy.optimizer_config.lr=1.0e-6 \
47+
trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
48+
generator.backend=vllm \
49+
generator.run_engines_locally=true \
50+
generator.weight_sync_backend=nccl \
51+
generator.async_engine=true \
52+
generator.batched=true \
53+
environment.env_class=gsm8k \
54+
generator.n_samples_per_prompt=5 \
55+
generator.gpu_memory_utilization=0.8 \
56+
trainer.logger="$LOGGER" \
57+
trainer.project_name="cispo_gsm8k" \
58+
trainer.run_name="cispo_gsm8k_test" \
59+
trainer.resume_mode=null \
60+
trainer.ckpt_path="$HOME/ckpts/cispo_gsm8k_1.5B_ckpt" \
61+
$@
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Clip-Cov and KL-Cov Policy Loss Examples
2+
3+
This directory contains examples for using **Clip-Cov** and **KL-Cov** policy loss functions, based on the implementation from [PRIME-RL/Entropy-Mechanism-of-RL](https://github.com/PRIME-RL/Entropy-Mechanism-of-RL).
4+
5+
## Overview
6+
7+
Both methods improve training stability by using covariance-based token selection:
8+
9+
- **Clip-Cov**: Combines standard PPO clipping with covariance-based correction masking
10+
- **KL-Cov**: Applies KL regularization to tokens selected based on covariance values
11+
12+
## Usage
13+
14+
### Prerequisites
15+
16+
1. Prepare GSM8K data:
17+
```bash
18+
uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
19+
```
20+
21+
2. Set up Weights & Biases (optional):
22+
```bash
23+
export WANDB_API_KEY=<your_key_here>
24+
```
25+
26+
### Running Clip-Cov
27+
28+
```bash
29+
bash examples/algorithms/clip_cov_kl_cov/run_clip_cov.sh
30+
```
31+
32+
**Key parameters:**
33+
- `trainer.algorithm.policy_loss_type="clip_cov"`
34+
- `trainer.algorithm.clip_cov.clip_ratio=0.0002` - fraction of tokens to clip based on covariance
35+
- `trainer.algorithm.clip_cov.clip_cov_lb=1.0` - lower bound for covariance clipping
36+
- `trainer.algorithm.clip_cov.clip_cov_ub=5.0` - upper bound for covariance clipping
37+
38+
### Running KL-Cov
39+
40+
```bash
41+
bash examples/algorithms/clip_cov_kl_cov/run_kl_cov.sh
42+
```
43+
44+
**Key parameters:**
45+
- `trainer.algorithm.policy_loss_type="kl_cov"`
46+
- `trainer.algorithm.kl_cov.kl_cov_frac=0.2` - percentage of tokens to apply KL regularization to (20%)
47+
- `trainer.algorithm.kl_cov.ppo_kl_coef=1.0` - coefficient for KL regularization term
48+
49+
## Configuration
50+
51+
Both methods are configured through the algorithm section of your config:
52+
53+
```yaml
54+
trainer:
55+
algorithm:
56+
policy_loss_type: "clip_cov" # or "kl_cov"
57+
58+
# Clip-Cov specific parameters
59+
clip_cov:
60+
clip_ratio: 0.0002
61+
clip_cov_lb: 1.0
62+
clip_cov_ub: 5.0
63+
64+
# KL-Cov specific parameters
65+
kl_cov:
66+
kl_cov_frac: 0.2
67+
ppo_kl_coef: 1.0
68+
```
69+
70+
71+
## Reference
72+
73+
- Paper: https://arxiv.org/abs/2505.22617
74+
- Code: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#!/bin/bash
2+
set -x
3+
4+
# Example of Clip-Cov policy loss training
5+
# Covariance-based clipping for improved training stability on GSM8K.
6+
#
7+
# Run data preparation first:
8+
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
9+
# export WANDB_API_KEY=<your_key_here>
10+
# bash examples/algorithms/clip_cov_kl_cov/run_clip_cov.sh
11+
12+
DATA_DIR="$HOME/data/gsm8k"
13+
NUM_GPUS=4
14+
LOGGER="wandb" # change to "console" to print to stdout
15+
16+
# Configure Clip-Cov parameters
17+
POLICY_LOSS="clip_cov"
18+
CLIP_COV_RATIO=0.0002
19+
CLIP_COV_LB=1.0
20+
CLIP_COV_UB=5.0
21+
22+
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
23+
data.train_data="['$DATA_DIR/train.parquet']" \
24+
data.val_data="['$DATA_DIR/validation.parquet']" \
25+
trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
26+
trainer.algorithm.clip_cov.clip_ratio=$CLIP_COV_RATIO \
27+
trainer.algorithm.clip_cov.clip_cov_lb=$CLIP_COV_LB \
28+
trainer.algorithm.clip_cov.clip_cov_ub=$CLIP_COV_UB \
29+
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
30+
trainer.placement.colocate_all=true \
31+
trainer.strategy=fsdp2 \
32+
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
33+
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
34+
generator.num_inference_engines=$NUM_GPUS \
35+
generator.inference_engine_tensor_parallel_size=1 \
36+
trainer.epochs=20 \
37+
trainer.eval_batch_size=1024 \
38+
trainer.eval_before_train=true \
39+
trainer.eval_interval=5 \
40+
trainer.update_epochs_per_batch=1 \
41+
trainer.train_batch_size=1024 \
42+
trainer.policy_mini_batch_size=256 \
43+
trainer.micro_forward_batch_size_per_gpu=64 \
44+
trainer.micro_train_batch_size_per_gpu=64 \
45+
trainer.ckpt_interval=10 \
46+
trainer.max_prompt_length=512 \
47+
generator.sampling_params.max_generate_length=1024 \
48+
trainer.policy.optimizer_config.lr=1.0e-6 \
49+
trainer.algorithm.use_kl_loss=true \
50+
trainer.algorithm.kl_loss_coef=0.001 \
51+
generator.backend=vllm \
52+
generator.run_engines_locally=true \
53+
generator.weight_sync_backend=nccl \
54+
generator.async_engine=true \
55+
generator.batched=true \
56+
environment.env_class=gsm8k \
57+
generator.n_samples_per_prompt=5 \
58+
generator.gpu_memory_utilization=0.8 \
59+
trainer.logger="$LOGGER" \
60+
trainer.project_name="clip_cov_gsm8k" \
61+
trainer.run_name="clip_cov_gsm8k_test" \
62+
trainer.resume_mode=null \
63+
trainer.ckpt_path="$HOME/ckpts/clip_cov_gsm8k_1.5B_ckpt" \
64+
$@
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
#!/bin/bash
2+
set -x
3+
4+
# Example of KL-Cov policy loss training
5+
# Uses covariance-based selection to apply KL regularization to a subset of tokens
6+
# for improved training stability on GSM8K.
7+
#
8+
# Run data preparation first:
9+
# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
10+
# export WANDB_API_KEY=<your_key_here>
11+
# bash examples/algorithms/clip_cov_kl_cov/run_kl_cov.sh
12+
13+
DATA_DIR="$HOME/data/gsm8k"
14+
NUM_GPUS=4
15+
LOGGER="wandb" # change to "console" to print to stdout
16+
17+
# Configure KL-Cov parameters
18+
POLICY_LOSS="kl_cov"
19+
KL_COV_FRAC=0.2
20+
PPO_KL_COEF=1.0
21+
22+
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
23+
data.train_data="['$DATA_DIR/train.parquet']" \
24+
data.val_data="['$DATA_DIR/validation.parquet']" \
25+
trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
26+
trainer.algorithm.kl_cov.kl_cov_frac=$KL_COV_FRAC \
27+
trainer.algorithm.kl_cov.ppo_kl_coef=$PPO_KL_COEF \
28+
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
29+
trainer.placement.colocate_all=true \
30+
trainer.strategy=fsdp2 \
31+
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
32+
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
33+
generator.num_inference_engines=$NUM_GPUS \
34+
generator.inference_engine_tensor_parallel_size=1 \
35+
trainer.epochs=20 \
36+
trainer.eval_batch_size=1024 \
37+
trainer.eval_before_train=true \
38+
trainer.eval_interval=5 \
39+
trainer.update_epochs_per_batch=1 \
40+
trainer.train_batch_size=1024 \
41+
trainer.policy_mini_batch_size=256 \
42+
trainer.micro_forward_batch_size_per_gpu=64 \
43+
trainer.micro_train_batch_size_per_gpu=64 \
44+
trainer.ckpt_interval=10 \
45+
trainer.max_prompt_length=512 \
46+
generator.sampling_params.max_generate_length=1024 \
47+
trainer.policy.optimizer_config.lr=1.0e-6 \
48+
trainer.algorithm.use_kl_loss=true \
49+
trainer.algorithm.kl_loss_coef=0.001 \
50+
generator.backend=vllm \
51+
generator.run_engines_locally=true \
52+
generator.weight_sync_backend=nccl \
53+
generator.async_engine=true \
54+
generator.batched=true \
55+
environment.env_class=gsm8k \
56+
generator.n_samples_per_prompt=5 \
57+
generator.gpu_memory_utilization=0.8 \
58+
trainer.logger="$LOGGER" \
59+
trainer.project_name="kl_cov_gsm8k" \
60+
trainer.run_name="kl_cov_gsm8k_test" \
61+
trainer.resume_mode=null \
62+
trainer.ckpt_path="$HOME/ckpts/kl_cov_gsm8k_1.5B_ckpt" \
63+
$@
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
"""
2+
uv run --isolated --extra vllm -m examples.algorithm.custom_advantage_estimator.main_custom_adv_est
3+
"""
4+
5+
import ray
6+
import hydra
7+
import torch
8+
import numpy as np
9+
from omegaconf import DictConfig
10+
from skyrl_train.utils import initialize_ray
11+
from skyrl_train.entrypoints.main_base import BasePPOExp, config_dir, validate_cfg
12+
from skyrl_train.utils.ppo_utils import AdvantageEstimatorRegistry
13+
14+
15+
# Example of custom advantage estimator: "simple_baseline"
16+
def compute_simple_baseline_advantage(
17+
token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: np.ndarray, **kwargs
18+
):
19+
"""
20+
A simple custom advantage estimator that uses response-level rewards
21+
and computes advantages against a simple baseline.
22+
23+
This is just an example - replace with your own logic.
24+
"""
25+
with torch.no_grad():
26+
response_rewards = (token_level_rewards * response_mask).sum(dim=-1, keepdim=True)
27+
28+
# Simple baseline: use the mean reward across the batch
29+
baseline = response_rewards.mean()
30+
advantages = (response_rewards - baseline) * response_mask
31+
returns = advantages.clone()
32+
33+
return advantages, returns
34+
35+
36+
# Register the custom advantage estimator
37+
AdvantageEstimatorRegistry.register("simple_baseline", compute_simple_baseline_advantage)
38+
39+
40+
@ray.remote(num_cpus=1)
41+
def skyrl_entrypoint(cfg: DictConfig):
42+
exp = BasePPOExp(cfg)
43+
exp.run()
44+
45+
46+
@hydra.main(config_path=config_dir, config_name="ppo_base_config", version_base=None)
47+
def main(cfg: DictConfig) -> None:
48+
# validate the arguments
49+
validate_cfg(cfg)
50+
51+
initialize_ray(cfg)
52+
ray.get(skyrl_entrypoint.remote(cfg))
53+
54+
55+
if __name__ == "__main__":
56+
main()

0 commit comments

Comments
 (0)