NovaSky-AI
diff --git a/‎skyrl/examples/train/README.md‎
Lines changed: 43 additions & 0 deletions b/‎skyrl/examples/train/README.md‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎skyrl/examples/train/algorithms/cispo/run_cispo_gsm8k.sh‎
Lines changed: 61 additions & 0 deletions b/‎skyrl/examples/train/algorithms/cispo/run_cispo_gsm8k.sh‎
Lines changed: 61 additions & 0 deletions
diff --git a/‎skyrl/examples/train/algorithms/clip_cov_kl_cov/README.md‎
Lines changed: 74 additions & 0 deletions b/‎skyrl/examples/train/algorithms/clip_cov_kl_cov/README.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎skyrl/examples/train/algorithms/clip_cov_kl_cov/run_clip_cov.sh‎
Lines changed: 64 additions & 0 deletions b/‎skyrl/examples/train/algorithms/clip_cov_kl_cov/run_clip_cov.sh‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎skyrl/examples/train/algorithms/clip_cov_kl_cov/run_kl_cov.sh‎
Lines changed: 63 additions & 0 deletions b/‎skyrl/examples/train/algorithms/clip_cov_kl_cov/run_kl_cov.sh‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎skyrl/examples/train/algorithms/custom_advantage_estimator/main_custom_adv_est.py‎
Lines changed: 56 additions & 0 deletions b/‎skyrl/examples/train/algorithms/custom_advantage_estimator/main_custom_adv_est.py‎
Lines changed: 56 additions & 0 deletions
@@ -0,0 +1,43 @@
+# SkyRL-Train Examples
+Welcome to the SkyRL-Train examples! In this folder you can find the following examples.
+
+## Algorithms
+
+- `algorithms/`: Examples for how to configure and run RL with various algorithms and policy-loss variants (e.g., DAPO, SAPO, GRPO, CISPO, GSPO, or your own custom advantage estimators and custom policy losses).
+- `ppo/`: Vanilla PPO training (with a critic, ref, and policy model)
+- `on_policy_distillation/`: [On-policy distillation recipe](https://novasky-ai.notion.site/on-policy-distillation) that uses a teacher model to provide dense token-level rewards during training, reproducing results from the [Thinking Machines blog](https://thinkingmachines.ai/blog/on-policy-distillation/).
+- `tis_correction/`: Applying [Flash-RL TIS](https://fengyao.notion.site/off-policy-rl) correction to improve off-policy stability.
+- `turn_level_rewards/`: GSM8K multi-turn environment illustrating turn-level rewards and custom advantage estimators.
+
+## Async RL
+
+- `async/`: One-step off-policy GRPO with an asynchronous generator–trainer loop.
+- `fully_async/`: Fully asynchronous (PipelineRL/AReal-style) GRPO training with in-flight weight updates. [See docs for full design + details](https://docs.skyrl.ai/docs/tutorials/one_step_off_async).
+
+## Tasks
+
+- `gsm8k/`: Basic GSM8K math word-problem dataset utilities and baseline training/generation scripts.
+- `llm_as_a_judge/`: GSM8K training with an external LLM as a judge to produce rewards instead of strict exact-match grading.
+- `multiply/`: Toy arithmetic environment for multiplying numbers, useful for quick sanity checks and debugging.
+- `livecodebench/`: LiveCodeBench code-generation task setup and training scripts.
+- `text_to_sql/`: [Text-to-SQL (SkyRL-SQL)](https://docs.skyrl.ai/docs/examples/multi_turn_text2sql) environment and training scripts for mapping natural language questions to SQL queries.
+- `step_wise/`: Step-wise training for chat-template agnostic multi-turn RL training.
+- `search/`: Multi-turn search agent training with the SearchR1 dataset, backed by a FAISS-based retriever server.
+
+## Integrations
+
+- `flash_rl/`: Integration with [FlashRL’s](https://fengyao.notion.site/flash-rl) patched vLLM inference engine for high-throughput RL training.
+- `harbor/`: Custom [Harbor](https://harborframework.com/) Generator for training agents to solve TerminalBench tasks.
+- `mini_swe_agent/`: Integration with [Mini-SWE-Agent](https://github.com/SWE-agent/mini-swe-agent) to train coding agents on SWE-Bench via SkyRL.
+- `../integrations/verifiers/`: Integration with PrimeIntellect's [Verifiers Library](https://github.com/PrimeIntellect-ai/verifiers) + [Environments Hub](https://app.primeintellect.ai/dashboard/environments?_gl=1*1vogwn8*_gcl_au*NjA1ODI2MTMxLjE3NjczOTkwMTM)
+- `../integrations/openenv/`: Integration with HuggingFace/Meta [OpenEnv](https://github.com/meta-pytorch/OpenEnv)
+
+## Large Scale Model Training
+- `megatron/`: Examples for running SkyRL with the Megatron Backend for 5D parallelism.
+- `moe/`: Work-in-progress MoE training example used for development and testing large-scale multi-node Mixture-of-Experts support.
+- `gptoss/`: Training example for the GPT-OSS-20B model using patched attention to support attention sinks.
+
+## Features and More
+- `lora/`: LoRA RL fine-tuning recipes.
+- `remote_inference_engine/`: Scripts for running remote vLLM/sglang inference servers and connecting them to SkyRL.
+- `training_backends/`: Runner scripts demonstrating how to use different training backends on SkyRL.
@@ -0,0 +1,61 @@
+#!/bin/bash
+set -x
+
+# Example of CISPO policy loss training
+# Clipped Importance Sampling Weight Policy Optimization (CISPO) for better RL efficiency
+
+# Run data preparation first:
+# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
+# export WANDB_API_KEY=<your_key_here>
+# bash examples/algorithms/cispo/run_cispo_gsm8k.sh
+
+DATA_DIR="$HOME/data/gsm8k"
+NUM_GPUS=4
+LOGGER="wandb"  # change to "console" to print to stdout
+
+# Configure CISPO parameters
+POLICY_LOSS="cispo"
+CISPO_EPS_CLIP_LOW=0
+CISPO_EPS_CLIP_HIGH=5
+USE_KL_LOSS=false
+
+uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
+  data.train_data="['$DATA_DIR/train.parquet']" \
+  data.val_data="['$DATA_DIR/validation.parquet']" \
+  trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
+  trainer.algorithm.cispo.cispo_eps_clip_low=$CISPO_EPS_CLIP_LOW \
+  trainer.algorithm.cispo.cispo_eps_clip_high=$CISPO_EPS_CLIP_HIGH \
+  trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
+  trainer.placement.colocate_all=true \
+  trainer.strategy=fsdp2 \
+  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
+  generator.num_inference_engines=$NUM_GPUS \
+  generator.inference_engine_tensor_parallel_size=1 \
+  trainer.epochs=20 \
+  trainer.eval_batch_size=1024 \
+  trainer.eval_before_train=true \
+  trainer.eval_interval=5 \
+  trainer.update_epochs_per_batch=1 \
+  trainer.train_batch_size=1024 \
+  trainer.policy_mini_batch_size=256 \
+  trainer.micro_forward_batch_size_per_gpu=64 \
+  trainer.micro_train_batch_size_per_gpu=64 \
+  trainer.ckpt_interval=10 \
+  trainer.max_prompt_length=512 \
+  generator.sampling_params.max_generate_length=1024 \
+  trainer.policy.optimizer_config.lr=1.0e-6 \
+  trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
+  generator.backend=vllm \
+  generator.run_engines_locally=true \
+  generator.weight_sync_backend=nccl \
+  generator.async_engine=true \
+  generator.batched=true \
+  environment.env_class=gsm8k \
+  generator.n_samples_per_prompt=5 \
+  generator.gpu_memory_utilization=0.8 \
+  trainer.logger="$LOGGER" \
+  trainer.project_name="cispo_gsm8k" \
+  trainer.run_name="cispo_gsm8k_test" \
+  trainer.resume_mode=null \
+  trainer.ckpt_path="$HOME/ckpts/cispo_gsm8k_1.5B_ckpt" \
+  $@
@@ -0,0 +1,74 @@
+# Clip-Cov and KL-Cov Policy Loss Examples
+
+This directory contains examples for using **Clip-Cov** and **KL-Cov** policy loss functions, based on the implementation from [PRIME-RL/Entropy-Mechanism-of-RL](https://github.com/PRIME-RL/Entropy-Mechanism-of-RL).
+
+## Overview
+
+Both methods improve training stability by using covariance-based token selection:
+
+- **Clip-Cov**: Combines standard PPO clipping with covariance-based correction masking
+- **KL-Cov**: Applies KL regularization to tokens selected based on covariance values
+
+## Usage
+
+### Prerequisites
+
+1. Prepare GSM8K data:
+```bash
+uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
+```
+
+2. Set up Weights & Biases (optional):
+```bash
+export WANDB_API_KEY=<your_key_here>
+```
+
+### Running Clip-Cov
+
+```bash
+bash examples/algorithms/clip_cov_kl_cov/run_clip_cov.sh
+```
+
+**Key parameters:**
+- `trainer.algorithm.policy_loss_type="clip_cov"`
+- `trainer.algorithm.clip_cov.clip_ratio=0.0002` - fraction of tokens to clip based on covariance
+- `trainer.algorithm.clip_cov.clip_cov_lb=1.0` - lower bound for covariance clipping
+- `trainer.algorithm.clip_cov.clip_cov_ub=5.0` - upper bound for covariance clipping
+
+### Running KL-Cov
+
+```bash
+bash examples/algorithms/clip_cov_kl_cov/run_kl_cov.sh
+```
+
+**Key parameters:**
+- `trainer.algorithm.policy_loss_type="kl_cov"`
+- `trainer.algorithm.kl_cov.kl_cov_frac=0.2` - percentage of tokens to apply KL regularization to (20%)
+- `trainer.algorithm.kl_cov.ppo_kl_coef=1.0` - coefficient for KL regularization term
+
+## Configuration
+
+Both methods are configured through the algorithm section of your config:
+
+```yaml
+trainer:
+  algorithm:
+    policy_loss_type: "clip_cov"  # or "kl_cov"
+    
+    # Clip-Cov specific parameters
+    clip_cov:
+      clip_ratio: 0.0002
+      clip_cov_lb: 1.0
+      clip_cov_ub: 5.0
+    
+    # KL-Cov specific parameters  
+    kl_cov:
+      kl_cov_frac: 0.2
+      ppo_kl_coef: 1.0
+```
+
+
+## Reference
+
+- Paper: https://arxiv.org/abs/2505.22617
+- Code: https://github.com/PRIME-RL/Entropy-Mechanism-of-RL
@@ -0,0 +1,64 @@
+#!/bin/bash
+set -x
+
+# Example of Clip-Cov policy loss training
+# Covariance-based clipping for improved training stability on GSM8K.
+#
+# Run data preparation first:
+# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
+# export WANDB_API_KEY=<your_key_here>
+# bash examples/algorithms/clip_cov_kl_cov/run_clip_cov.sh
+
+DATA_DIR="$HOME/data/gsm8k"
+NUM_GPUS=4
+LOGGER="wandb"  # change to "console" to print to stdout
+
+# Configure Clip-Cov parameters
+POLICY_LOSS="clip_cov"
+CLIP_COV_RATIO=0.0002
+CLIP_COV_LB=1.0
+CLIP_COV_UB=5.0
+
+uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
+  data.train_data="['$DATA_DIR/train.parquet']" \
+  data.val_data="['$DATA_DIR/validation.parquet']" \
+  trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
+  trainer.algorithm.clip_cov.clip_ratio=$CLIP_COV_RATIO \
+  trainer.algorithm.clip_cov.clip_cov_lb=$CLIP_COV_LB \
+  trainer.algorithm.clip_cov.clip_cov_ub=$CLIP_COV_UB \
+  trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
+  trainer.placement.colocate_all=true \
+  trainer.strategy=fsdp2 \
+  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
+  trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
+  generator.num_inference_engines=$NUM_GPUS \
+  generator.inference_engine_tensor_parallel_size=1 \
+  trainer.epochs=20 \
+  trainer.eval_batch_size=1024 \
+  trainer.eval_before_train=true \
+  trainer.eval_interval=5 \
+  trainer.update_epochs_per_batch=1 \
+  trainer.train_batch_size=1024 \
+  trainer.policy_mini_batch_size=256 \
+  trainer.micro_forward_batch_size_per_gpu=64 \
+  trainer.micro_train_batch_size_per_gpu=64 \
+  trainer.ckpt_interval=10 \
+  trainer.max_prompt_length=512 \
+  generator.sampling_params.max_generate_length=1024 \
+  trainer.policy.optimizer_config.lr=1.0e-6 \
+  trainer.algorithm.use_kl_loss=true \
+  trainer.algorithm.kl_loss_coef=0.001 \
+  generator.backend=vllm \
+  generator.run_engines_locally=true \
+  generator.weight_sync_backend=nccl \
+  generator.async_engine=true \
+  generator.batched=true \
+  environment.env_class=gsm8k \
+  generator.n_samples_per_prompt=5 \
+  generator.gpu_memory_utilization=0.8 \
+  trainer.logger="$LOGGER" \
+  trainer.project_name="clip_cov_gsm8k" \
+  trainer.run_name="clip_cov_gsm8k_test" \
+  trainer.resume_mode=null \
+  trainer.ckpt_path="$HOME/ckpts/clip_cov_gsm8k_1.5B_ckpt" \
+  $@
@@ -0,0 +1,63 @@
+#!/bin/bash
+set -x
+
+# Example of KL-Cov policy loss training
+# Uses covariance-based selection to apply KL regularization to a subset of tokens
+# for improved training stability on GSM8K.
+#
+# Run data preparation first:
+# uv run examples/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k
+# export WANDB_API_KEY=<your_key_here>
+# bash examples/algorithms/clip_cov_kl_cov/run_kl_cov.sh
+
+DATA_DIR="$HOME/data/gsm8k"
+NUM_GPUS=4
+LOGGER="wandb"  # change to "console" to print to stdout
+
+# Configure KL-Cov parameters
+POLICY_LOSS="kl_cov"
+KL_COV_FRAC=0.2
+PPO_KL_COEF=1.0
+
+uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
+  data.train_data="['$DATA_DIR/train.parquet']" \
+  data.val_data="['$DATA_DIR/validation.parquet']" \
+  trainer.algorithm.policy_loss_type="$POLICY_LOSS" \
+  trainer.algorithm.kl_cov.kl_cov_frac=$KL_COV_FRAC \
+  trainer.algorithm.kl_cov.ppo_kl_coef=$PPO_KL_COEF \
+  trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
+  trainer.placement.colocate_all=true \
+  trainer.strategy=fsdp2 \
+  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
+  trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
+  generator.num_inference_engines=$NUM_GPUS \
+  generator.inference_engine_tensor_parallel_size=1 \
+  trainer.epochs=20 \
+  trainer.eval_batch_size=1024 \
+  trainer.eval_before_train=true \
+  trainer.eval_interval=5 \
+  trainer.update_epochs_per_batch=1 \
+  trainer.train_batch_size=1024 \
+  trainer.policy_mini_batch_size=256 \
+  trainer.micro_forward_batch_size_per_gpu=64 \
+  trainer.micro_train_batch_size_per_gpu=64 \
+  trainer.ckpt_interval=10 \
+  trainer.max_prompt_length=512 \
+  generator.sampling_params.max_generate_length=1024 \
+  trainer.policy.optimizer_config.lr=1.0e-6 \
+  trainer.algorithm.use_kl_loss=true \
+  trainer.algorithm.kl_loss_coef=0.001 \
+  generator.backend=vllm \
+  generator.run_engines_locally=true \
+  generator.weight_sync_backend=nccl \
+  generator.async_engine=true \
+  generator.batched=true \
+  environment.env_class=gsm8k \
+  generator.n_samples_per_prompt=5 \
+  generator.gpu_memory_utilization=0.8 \
+  trainer.logger="$LOGGER" \
+  trainer.project_name="kl_cov_gsm8k" \
+  trainer.run_name="kl_cov_gsm8k_test" \
+  trainer.resume_mode=null \
+  trainer.ckpt_path="$HOME/ckpts/kl_cov_gsm8k_1.5B_ckpt" \
+  $@
@@ -0,0 +1,56 @@
+"""
+uv run --isolated --extra vllm -m examples.algorithm.custom_advantage_estimator.main_custom_adv_est
+"""
+
+import ray
+import hydra
+import torch
+import numpy as np
+from omegaconf import DictConfig
+from skyrl_train.utils import initialize_ray
+from skyrl_train.entrypoints.main_base import BasePPOExp, config_dir, validate_cfg
+from skyrl_train.utils.ppo_utils import AdvantageEstimatorRegistry
+
+
+# Example of custom advantage estimator: "simple_baseline"
+def compute_simple_baseline_advantage(
+    token_level_rewards: torch.Tensor, response_mask: torch.Tensor, index: np.ndarray, **kwargs
+):
+    """
+    A simple custom advantage estimator that uses response-level rewards
+    and computes advantages against a simple baseline.
+
+    This is just an example - replace with your own logic.
+    """
+    with torch.no_grad():
+        response_rewards = (token_level_rewards * response_mask).sum(dim=-1, keepdim=True)
+
+        # Simple baseline: use the mean reward across the batch
+        baseline = response_rewards.mean()
+        advantages = (response_rewards - baseline) * response_mask
+        returns = advantages.clone()
+
+        return advantages, returns
+
+
+# Register the custom advantage estimator
+AdvantageEstimatorRegistry.register("simple_baseline", compute_simple_baseline_advantage)
+
+
+@ray.remote(num_cpus=1)
+def skyrl_entrypoint(cfg: DictConfig):
+    exp = BasePPOExp(cfg)
+    exp.run()
+
+
+@hydra.main(config_path=config_dir, config_name="ppo_base_config", version_base=None)
+def main(cfg: DictConfig) -> None:
+    # validate the arguments
+    validate_cfg(cfg)
+
+    initialize_ray(cfg)
+    ray.get(skyrl_entrypoint.remote(cfg))
+
+
+if __name__ == "__main__":
+    main()