Skip to content
57 changes: 57 additions & 0 deletions MIGRATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Migrating from TRL v0 to v1

This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal.

## Changed defaults

| Config | Parameter | v0 default | v1 default | Action needed |
|---|---|---|---|---|
| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. |
| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. |

## Already changed in v0.29

The following changes were introduced in v0.29 and are **not new in v1**. They are listed here for completeness if you are migrating from an earlier version.

<details>
<summary>Trainers moved to experimental</summary>

Several trainers were moved from the stable API to `trl.experimental`. They are no longer importable from `trl` directly (except KTO, which still has a compatibility shim with a deprecation warning).

| Trainer | New import |
|---|---|
| PPO | `from trl.experimental.ppo import PPOTrainer, PPOConfig` |
| CPO | `from trl.experimental.cpo import CPOTrainer, CPOConfig` |
| BCO | `from trl.experimental.bco import BCOTrainer, BCOConfig` |
| ORPO | `from trl.experimental.orpo import ORPOTrainer, ORPOConfig` |
| XPO | `from trl.experimental.xpo import XPOTrainer, XPOConfig` |
| Online DPO | `from trl.experimental.online_dpo import OnlineDPOTrainer, OnlineDPOConfig` |
| GKD | `from trl.experimental.gkd import GKDTrainer, GKDConfig` |
| Nash-MD | `from trl.experimental.nash_md import NashMDTrainer, NashMDConfig` |
| PRM | `from trl.experimental.prm import PRMTrainer, PRMConfig` |
| KTO | `from trl.experimental.kto import KTOTrainer, KTOConfig` |

</details>

<details>
<summary>Removed model classes</summary>

| Class | New location |
|---|---|
| `AutoModelForCausalLMWithValueHead` | `trl.experimental.ppo` |
| `AutoModelForSeq2SeqLMWithValueHead` | `trl.experimental.ppo` |
| `PreTrainedModelWrapper` | `trl.experimental.ppo` |

</details>

<details>
<summary>Removed callbacks and utilities</summary>

| What | New location |
|---|---|
| `WinRateCallback` | `trl.experimental.winrate_callback` |
| Judges | `trl.experimental.judges` |
| `peft_module_casting_to_bf16` | `trl.experimental.utils` |
| `FDivergenceType` enum | Removed. Use string values (`"reverse_kl"`, `"js_divergence"`, `"alpha_divergence"`) directly. |

</details>
32 changes: 16 additions & 16 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat
> [!TIP]
> By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)

#### 🔌 Option 1: Server mode
#### Option 1: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.

```python
from trl import GRPOConfig

training_args = GRPOConfig(
...,
use_vllm=True, # vllm_mode="colocate" by default
)
```

#### Option 2: Server mode

In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.

Expand All @@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
vllm_mode="server",
)
```

> [!WARNING]
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.

#### 🧩 Option 2: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.

```python
from trl import GRPOConfig

training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```

> [!TIP]
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
>
Expand Down Expand Up @@ -349,6 +348,7 @@ def main():
training_args = GRPOConfig(
per_device_train_batch_size=4,
use_vllm=True,
vllm_mode="server",
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)

Expand Down
32 changes: 16 additions & 16 deletions docs/source/rloo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,20 @@ pip install trl[vllm]

We support two ways of using vLLM during training: **server mode** and **colocate mode**.

#### 🔌 Option 1: Server mode
#### Option 1: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.

```python
from trl import RLOOConfig

training_args = RLOOConfig(
...,
use_vllm=True, # vllm_mode="colocate" by default
)
```

#### Option 2: Server mode

In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.

Expand All @@ -179,27 +192,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
vllm_mode="server",
)
```

> [!WARNING]
> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.

#### 🧩 Option 2: Colocate mode

In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.

```python
from trl import RLOOConfig

training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
```

> [!TIP]
> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
>
Expand Down Expand Up @@ -278,6 +277,7 @@ def main():
per_device_train_batch_size=4,
bf16=True,
use_vllm=True,
vllm_mode="server",
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)

Expand Down
6 changes: 3 additions & 3 deletions docs/source/speeding_up_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
```python
from trl.experimental.online_dpo import OnlineDPOConfig

training_args = OnlineDPOConfig(..., use_vllm=True)
training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server")
```

</hfoption>
Expand All @@ -44,7 +44,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
```python
from trl import GRPOConfig

training_args = GRPOConfig(..., use_vllm=True)
training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")
```

You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
Expand Down Expand Up @@ -78,7 +78,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
```python
from trl import RLOOConfig

training_args = RLOOConfig(..., use_vllm=True)
training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server")
```

You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
Expand Down
49 changes: 22 additions & 27 deletions docs/source/vllm_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
model="Qwen/Qwen2.5-7B",
args=GRPOConfig(use_vllm=True),
args=GRPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -72,7 +72,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = OnlineDPOTrainer(
model="Qwen/Qwen2.5-7B",
args=OnlineDPOConfig(use_vllm=True),
args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -92,7 +92,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = NashMDTrainer(
model="Qwen/Qwen2.5-7B",
args=NashMDConfig(use_vllm=True),
args=NashMDConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -112,7 +112,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = XPOTrainer(
model="Qwen/Qwen2.5-7B",
args=XPOConfig(use_vllm=True),
args=XPOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand All @@ -132,7 +132,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = RLOOTrainer(
model="Qwen/Qwen2.5-7B",
args=RLOOConfig(use_vllm=True),
args=RLOOConfig(use_vllm=True, vllm_mode="server"),
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
Expand Down Expand Up @@ -276,12 +276,12 @@ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/

### Modes of Using vLLM During Training

TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**.
TRL supports **two modes** for integrating vLLM during training: **colocate mode** (default) and **server mode**.

#### Server Mode
#### Colocate Mode

In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
This setup is ideal if you have GPUs dedicated to inference.
In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.

Example configuration:

Expand All @@ -293,8 +293,7 @@ from trl import GRPOConfig

training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -306,8 +305,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig

training_args = OnlineDPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -319,8 +317,7 @@ from trl.experimental.nash_md import NashMDConfig

training_args = NashMDConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -332,8 +329,7 @@ from trl.experimental.xpo import XPOConfig

training_args = XPOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

Expand All @@ -345,18 +341,17 @@ from trl import RLOOConfig

training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="server", # default value, can be omitted
use_vllm=True, # vllm_mode="colocate" by default
)
```

</hfoption>
</hfoptions>

#### Colocate Mode
#### Server Mode

In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
This setup is ideal if you have GPUs dedicated to inference.

Example configuration:

Expand All @@ -369,7 +364,7 @@ from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -382,7 +377,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
training_args = OnlineDPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -395,7 +390,7 @@ from trl.experimental.nash_md import NashMDConfig
training_args = NashMDConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -408,7 +403,7 @@ from trl.experimental.xpo import XPOConfig
training_args = XPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand All @@ -421,7 +416,7 @@ from trl import RLOOConfig
training_args = RLOOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
vllm_mode="server",
)
```

Expand Down
5 changes: 3 additions & 2 deletions tests/experimental/test_online_dpo_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ def test_training_with_judge(self, config_name):
@require_torch_accelerator
@require_vllm
@pytest.mark.slow
def test_training_with_vllm(self, config_name):
def test_training_with_vllm_server(self, config_name):
def cleanup_vllm_communicator(trainer):
"""Clean up vLLM communicator to avoid conflicts between test runs"""
try:
Expand All @@ -258,6 +258,7 @@ def cleanup_vllm_communicator(trainer):
training_args = OnlineDPOConfig(
output_dir=self.tmp_dir,
use_vllm=True,
vllm_mode="server",
vllm_gpu_memory_utilization=0.2,
report_to="none",
)
Expand Down Expand Up @@ -351,7 +352,7 @@ def test_vllm_config_validation(self):

# Test default values
config = OnlineDPOConfig()
assert config.vllm_mode == "server"
assert config.vllm_mode == "colocate"
assert config.vllm_server_base_url is None
assert config.vllm_server_host == "0.0.0.0"
assert config.vllm_server_port == 8000
Expand Down
Loading
Loading