huggingface · qgallouedec · Mar 14, 2026 · Mar 10, 2026 · Mar 10, 2026 · Mar 10, 2026
diff --git a/MIGRATION.md b/MIGRATION.md
@@ -0,0 +1,57 @@
+# Migrating from TRL v0 to v1
+
+This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal.
+
+## Changed defaults
+
+| Config | Parameter | v0 default | v1 default | Action needed |
+|---|---|---|---|---|
+| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. |
+| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. |
+
+## Already changed in v0.29
+
+The following changes were introduced in v0.29 and are **not new in v1**. They are listed here for completeness if you are migrating from an earlier version.
+
+<details>
+<summary>Trainers moved to experimental</summary>
+
+Several trainers were moved from the stable API to `trl.experimental`. They are no longer importable from `trl` directly (except KTO, which still has a compatibility shim with a deprecation warning).
+
+| Trainer | New import |
+|---|---|
+| PPO | `from trl.experimental.ppo import PPOTrainer, PPOConfig` |
+| CPO | `from trl.experimental.cpo import CPOTrainer, CPOConfig` |
+| BCO | `from trl.experimental.bco import BCOTrainer, BCOConfig` |
+| ORPO | `from trl.experimental.orpo import ORPOTrainer, ORPOConfig` |
+| XPO | `from trl.experimental.xpo import XPOTrainer, XPOConfig` |
+| Online DPO | `from trl.experimental.online_dpo import OnlineDPOTrainer, OnlineDPOConfig` |
+| GKD | `from trl.experimental.gkd import GKDTrainer, GKDConfig` |
+| Nash-MD | `from trl.experimental.nash_md import NashMDTrainer, NashMDConfig` |
+| PRM | `from trl.experimental.prm import PRMTrainer, PRMConfig` |
+| KTO | `from trl.experimental.kto import KTOTrainer, KTOConfig` |
+
+</details>
+
+<details>
+<summary>Removed model classes</summary>
+
+| Class | New location |
+|---|---|
+| `AutoModelForCausalLMWithValueHead` | `trl.experimental.ppo` |
+| `AutoModelForSeq2SeqLMWithValueHead` | `trl.experimental.ppo` |
+| `PreTrainedModelWrapper` | `trl.experimental.ppo` |
+
+</details>
+
+<details>
+<summary>Removed callbacks and utilities</summary>
+
+| What | New location |
+|---|---|
+| `WinRateCallback` | `trl.experimental.winrate_callback` |
+| Judges | `trl.experimental.judges` |
+| `peft_module_casting_to_bf16` | `trl.experimental.utils` |
+| `FDivergenceType` enum | Removed. Use string values (`"reverse_kl"`, `"js_divergence"`, `"alpha_divergence"`) directly. |
+
+</details>
diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md
@@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat
 > [!TIP]
 > By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
 
-#### 🔌 Option 1: Server mode
+#### Option 1: Colocate mode
+
+In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
+
+```python
+from trl import GRPOConfig
+
+training_args = GRPOConfig(
+    ...,
+    use_vllm=True,  # vllm_mode="colocate" by default
+)
+```
+
+#### Option 2: Server mode
 
 In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
 
@@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
    training_args = GRPOConfig(
        ...,
        use_vllm=True,
-       vllm_mode="server",  # default value, can be omitted
+       vllm_mode="server",
    )
    ```
 
 > [!WARNING]
 > Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
 
-#### 🧩 Option 2: Colocate mode
-
-In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
-
-```python
-from trl import GRPOConfig
-
-training_args = GRPOConfig(
-    ...,
-    use_vllm=True,
-    vllm_mode="colocate",
-)
-```
-
 > [!TIP]
 > Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
 >
@@ -349,6 +348,7 @@ def main():
     training_args = GRPOConfig(
         per_device_train_batch_size=4,
         use_vllm=True,
+        vllm_mode="server",
         vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."),  # from ip-X-X-X-X to X.X.X.X
     )
 

diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md
@@ -161,7 +161,20 @@ pip install trl[vllm]
 
 We support two ways of using vLLM during training: **server mode** and **colocate mode**.
 
-#### 🔌 Option 1: Server mode
+#### Option 1: Colocate mode
+
+In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
+
+```python
+from trl import RLOOConfig
+
+training_args = RLOOConfig(
+    ...,
+    use_vllm=True,  # vllm_mode="colocate" by default
+)
+```
+
+#### Option 2: Server mode
 
 In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
 
@@ -179,27 +192,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
    training_args = RLOOConfig(
        ...,
        use_vllm=True,
-       vllm_mode="server",  # default value, can be omitted
+       vllm_mode="server",
    )
    ```
 
 > [!WARNING]
 > Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
 
-#### 🧩 Option 2: Colocate mode
-
-In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
-
-```python
-from trl import RLOOConfig
-
-training_args = RLOOConfig(
-    ...,
-    use_vllm=True,
-    vllm_mode="colocate",
-)
-```
-
 > [!TIP]
 > Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
 >
@@ -278,6 +277,7 @@ def main():
         per_device_train_batch_size=4,
         bf16=True,
         use_vllm=True,
+        vllm_mode="server",
         vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."),  # from ip-X-X-X-X to X.X.X.X
     )
 

diff --git a/docs/source/speeding_up_training.md b/docs/source/speeding_up_training.md
@@ -27,7 +27,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
 ```python
 from trl.experimental.online_dpo import OnlineDPOConfig
 
-training_args = OnlineDPOConfig(..., use_vllm=True)
+training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server")
 ```
 
 </hfoption>
@@ -44,7 +44,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
 ```python
 from trl import GRPOConfig
 
-training_args = GRPOConfig(..., use_vllm=True)
+training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server")
 ```
 
 You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
@@ -78,7 +78,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments
 ```python
 from trl import RLOOConfig
 
-training_args = RLOOConfig(..., use_vllm=True)
+training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server")
 ```
 
 You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).

diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md
@@ -52,7 +52,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = GRPOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=GRPOConfig(use_vllm=True),
+    args=GRPOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -72,7 +72,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = OnlineDPOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=OnlineDPOConfig(use_vllm=True),
+    args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -92,7 +92,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = NashMDTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=NashMDConfig(use_vllm=True),
+    args=NashMDConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -112,7 +112,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = XPOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=XPOConfig(use_vllm=True),
+    args=XPOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -132,7 +132,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
 
 trainer = RLOOTrainer(
     model="Qwen/Qwen2.5-7B",
-    args=RLOOConfig(use_vllm=True),
+    args=RLOOConfig(use_vllm=True, vllm_mode="server"),
     reward_funcs=accuracy_reward,
     train_dataset=dataset,
 )
@@ -276,12 +276,12 @@ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/
 
 ### Modes of Using vLLM During Training
 
-TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**.
+TRL supports **two modes** for integrating vLLM during training: **colocate mode** (default) and **server mode**.
 
-#### Server Mode
+#### Colocate Mode
 
-In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
-This setup is ideal if you have GPUs dedicated to inference.
+In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
+This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode.
 
 Example configuration:
 
@@ -293,8 +293,7 @@ from trl import GRPOConfig
 
 training_args = GRPOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -306,8 +305,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
 
 training_args = OnlineDPOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -319,8 +317,7 @@ from trl.experimental.nash_md import NashMDConfig
 
 training_args = NashMDConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -332,8 +329,7 @@ from trl.experimental.xpo import XPOConfig
 
 training_args = XPOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
@@ -345,18 +341,17 @@ from trl import RLOOConfig
 
 training_args = RLOOConfig(
     ...,
-    use_vllm=True,
-    vllm_mode="server",  # default value, can be omitted
+    use_vllm=True,  # vllm_mode="colocate" by default
 )
 ```
 
 </hfoption>
 </hfoptions>
 
-#### Colocate Mode
+#### Server Mode
 
-In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model.
-This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
+In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP.
+This setup is ideal if you have GPUs dedicated to inference.
 
 Example configuration:
 
@@ -369,7 +364,7 @@ from trl import GRPOConfig
 training_args = GRPOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -382,7 +377,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig
 training_args = OnlineDPOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -395,7 +390,7 @@ from trl.experimental.nash_md import NashMDConfig
 training_args = NashMDConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -408,7 +403,7 @@ from trl.experimental.xpo import XPOConfig
 training_args = XPOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 
@@ -421,7 +416,7 @@ from trl import RLOOConfig
 training_args = RLOOConfig(
     ...,
     use_vllm=True,
-    vllm_mode="colocate",
+    vllm_mode="server",
 )
 ```
 

diff --git a/tests/experimental/test_online_dpo_trainer.py b/tests/experimental/test_online_dpo_trainer.py
@@ -241,7 +241,7 @@ def test_training_with_judge(self, config_name):
     @require_torch_accelerator
     @require_vllm
     @pytest.mark.slow
-    def test_training_with_vllm(self, config_name):
+    def test_training_with_vllm_server(self, config_name):
         def cleanup_vllm_communicator(trainer):
             """Clean up vLLM communicator to avoid conflicts between test runs"""
             try:
@@ -258,6 +258,7 @@ def cleanup_vllm_communicator(trainer):
         training_args = OnlineDPOConfig(
             output_dir=self.tmp_dir,
             use_vllm=True,
+            vllm_mode="server",
             vllm_gpu_memory_utilization=0.2,
             report_to="none",
         )
@@ -351,7 +352,7 @@ def test_vllm_config_validation(self):
 
         # Test default values
         config = OnlineDPOConfig()
-        assert config.vllm_mode == "server"
+        assert config.vllm_mode == "colocate"
         assert config.vllm_server_base_url is None
         assert config.vllm_server_host == "0.0.0.0"
         assert config.vllm_server_port == 8000