diff --git a/MIGRATION.md b/MIGRATION.md new file mode 100644 index 0000000000..6877b331c7 --- /dev/null +++ b/MIGRATION.md @@ -0,0 +1,20 @@ +# Migrating from TRL v0 to v1 + +This guide covers the breaking changes introduced in TRL v1 and how to update your code. Most structural changes (trainers moved to experimental, removed model classes, etc.) already shipped in v0.29 — if you're already on v0.29, this migration is minimal. + +## Changed defaults + +| Config | Parameter | v0 default | v1 default | Action needed | +| --- | --- | --- | --- | --- | +| `GRPOConfig` | `vllm_mode` | `"server"` | `"colocate"` | If you use `use_vllm=True` without specifying `vllm_mode`, vLLM will now run in the same process instead of connecting to a separate server. Set `vllm_mode="server"` explicitly if you rely on server mode. | +| `RLOOConfig` | `vllm_mode` | `"server"` | `"colocate"` | Same as above. | + +## Renamed options + +| Config | Parameter | v0 value | v1 value | Action needed | +| --- | --- | --- | --- | --- | +| `SFTConfig` | `packing` | `"bfd-requeue"` | `"bfd_split"` | Replace `packing="bfd-requeue"` with `packing="bfd_split"`. The old value will still be accepted for a few versions but will be removed in a future release. | + +## Migrating from an earlier version + +Depending on which version you're migrating from, refer to the [release notes](https://github.com/huggingface/trl/releases) for v0.29 and earlier for version-specific changes. diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md index 7f3c095042..3e9fe6c048 100644 --- a/docs/source/grpo_trainer.md +++ b/docs/source/grpo_trainer.md @@ -206,7 +206,20 @@ We support two ways of using vLLM during training: **server mode** and **colocat > [!TIP] > By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling) -#### 🔌 Option 1: Server mode +#### Option 1: Colocate mode + +In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode. + +```python +from trl import GRPOConfig + +training_args = GRPOConfig( + ..., + use_vllm=True, # vllm_mode="colocate" by default +) +``` + +#### Option 2: Server mode In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference. @@ -224,27 +237,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm training_args = GRPOConfig( ..., use_vllm=True, - vllm_mode="server", # default value, can be omitted + vllm_mode="server", ) ``` > [!WARNING] > Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable. -#### 🧩 Option 2: Colocate mode - -In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. - -```python -from trl import GRPOConfig - -training_args = GRPOConfig( - ..., - use_vllm=True, - vllm_mode="colocate", -) -``` - > [!TIP] > Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors. > @@ -349,6 +348,7 @@ def main(): training_args = GRPOConfig( per_device_train_batch_size=4, use_vllm=True, + vllm_mode="server", vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X ) diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md index ef7db32d6a..817bb78e11 100644 --- a/docs/source/rloo_trainer.md +++ b/docs/source/rloo_trainer.md @@ -161,7 +161,20 @@ pip install trl[vllm] We support two ways of using vLLM during training: **server mode** and **colocate mode**. -#### 🔌 Option 1: Server mode +#### Option 1: Colocate mode + +In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode. + +```python +from trl import RLOOConfig + +training_args = RLOOConfig( + ..., + use_vllm=True, # vllm_mode="colocate" by default +) +``` + +#### Option 2: Server mode In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference. @@ -179,27 +192,13 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm training_args = RLOOConfig( ..., use_vllm=True, - vllm_mode="server", # default value, can be omitted + vllm_mode="server", ) ``` > [!WARNING] > Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable. -#### 🧩 Option 2: Colocate mode - -In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. - -```python -from trl import RLOOConfig - -training_args = RLOOConfig( - ..., - use_vllm=True, - vllm_mode="colocate", -) -``` - > [!TIP] > Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors. > @@ -278,6 +277,7 @@ def main(): per_device_train_batch_size=4, bf16=True, use_vllm=True, + vllm_mode="server", vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X ) diff --git a/docs/source/speeding_up_training.md b/docs/source/speeding_up_training.md index ec34a27454..c855cc0623 100644 --- a/docs/source/speeding_up_training.md +++ b/docs/source/speeding_up_training.md @@ -27,7 +27,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments ```python from trl.experimental.online_dpo import OnlineDPOConfig -training_args = OnlineDPOConfig(..., use_vllm=True) +training_args = OnlineDPOConfig(..., use_vllm=True, vllm_mode="server") ``` @@ -44,7 +44,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments ```python from trl import GRPOConfig -training_args = GRPOConfig(..., use_vllm=True) +training_args = GRPOConfig(..., use_vllm=True, vllm_mode="server") ``` You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration). @@ -78,7 +78,7 @@ Then, run the training script and pass `use_vllm=True` in the training arguments ```python from trl import RLOOConfig -training_args = RLOOConfig(..., use_vllm=True) +training_args = RLOOConfig(..., use_vllm=True, vllm_mode="server") ``` You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration). diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md index d725bbaad3..46eaf43496 100644 --- a/docs/source/vllm_integration.md +++ b/docs/source/vllm_integration.md @@ -52,7 +52,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train") trainer = GRPOTrainer( model="Qwen/Qwen2.5-7B", - args=GRPOConfig(use_vllm=True), + args=GRPOConfig(use_vllm=True, vllm_mode="server"), reward_funcs=accuracy_reward, train_dataset=dataset, ) @@ -72,7 +72,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train") trainer = OnlineDPOTrainer( model="Qwen/Qwen2.5-7B", - args=OnlineDPOConfig(use_vllm=True), + args=OnlineDPOConfig(use_vllm=True, vllm_mode="server"), reward_funcs=accuracy_reward, train_dataset=dataset, ) @@ -92,7 +92,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train") trainer = NashMDTrainer( model="Qwen/Qwen2.5-7B", - args=NashMDConfig(use_vllm=True), + args=NashMDConfig(use_vllm=True, vllm_mode="server"), reward_funcs=accuracy_reward, train_dataset=dataset, ) @@ -112,7 +112,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train") trainer = XPOTrainer( model="Qwen/Qwen2.5-7B", - args=XPOConfig(use_vllm=True), + args=XPOConfig(use_vllm=True, vllm_mode="server"), reward_funcs=accuracy_reward, train_dataset=dataset, ) @@ -132,7 +132,7 @@ dataset = load_dataset("trl-lib/DeepMath-103K", split="train") trainer = RLOOTrainer( model="Qwen/Qwen2.5-7B", - args=RLOOConfig(use_vllm=True), + args=RLOOConfig(use_vllm=True, vllm_mode="server"), reward_funcs=accuracy_reward, train_dataset=dataset, ) @@ -276,12 +276,12 @@ CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/ ### Modes of Using vLLM During Training -TRL supports **two modes** for integrating vLLM during training: **server mode** and **colocate mode**. +TRL supports **two modes** for integrating vLLM during training: **colocate mode** (default) and **server mode**. -#### Server Mode +#### Colocate Mode -In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP. -This setup is ideal if you have GPUs dedicated to inference. +In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model. +This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. This is the default mode. Example configuration: @@ -293,8 +293,7 @@ from trl import GRPOConfig training_args = GRPOConfig( ..., - use_vllm=True, - vllm_mode="server", # default value, can be omitted + use_vllm=True, # vllm_mode="colocate" by default ) ``` @@ -306,8 +305,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig training_args = OnlineDPOConfig( ..., - use_vllm=True, - vllm_mode="server", # default value, can be omitted + use_vllm=True, # vllm_mode="colocate" by default ) ``` @@ -319,8 +317,7 @@ from trl.experimental.nash_md import NashMDConfig training_args = NashMDConfig( ..., - use_vllm=True, - vllm_mode="server", # default value, can be omitted + use_vllm=True, # vllm_mode="colocate" by default ) ``` @@ -332,8 +329,7 @@ from trl.experimental.xpo import XPOConfig training_args = XPOConfig( ..., - use_vllm=True, - vllm_mode="server", # default value, can be omitted + use_vllm=True, # vllm_mode="colocate" by default ) ``` @@ -345,18 +341,17 @@ from trl import RLOOConfig training_args = RLOOConfig( ..., - use_vllm=True, - vllm_mode="server", # default value, can be omitted + use_vllm=True, # vllm_mode="colocate" by default ) ``` -#### Colocate Mode +#### Server Mode -In **colocate mode**, vLLM runs inside the trainer process and shares GPU memory with the training model. -This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs. +In **server mode**, vLLM runs as a separate process on dedicated GPUs and communicates with the trainer via HTTP. +This setup is ideal if you have GPUs dedicated to inference. Example configuration: @@ -369,7 +364,7 @@ from trl import GRPOConfig training_args = GRPOConfig( ..., use_vllm=True, - vllm_mode="colocate", + vllm_mode="server", ) ``` @@ -382,7 +377,7 @@ from trl.experimental.online_dpo import OnlineDPOConfig training_args = OnlineDPOConfig( ..., use_vllm=True, - vllm_mode="colocate", + vllm_mode="server", ) ``` @@ -395,7 +390,7 @@ from trl.experimental.nash_md import NashMDConfig training_args = NashMDConfig( ..., use_vllm=True, - vllm_mode="colocate", + vllm_mode="server", ) ``` @@ -408,7 +403,7 @@ from trl.experimental.xpo import XPOConfig training_args = XPOConfig( ..., use_vllm=True, - vllm_mode="colocate", + vllm_mode="server", ) ``` @@ -421,7 +416,7 @@ from trl import RLOOConfig training_args = RLOOConfig( ..., use_vllm=True, - vllm_mode="colocate", + vllm_mode="server", ) ``` diff --git a/tests/experimental/test_online_dpo_trainer.py b/tests/experimental/test_online_dpo_trainer.py index c3bd23cf41..e55263f4e1 100644 --- a/tests/experimental/test_online_dpo_trainer.py +++ b/tests/experimental/test_online_dpo_trainer.py @@ -241,7 +241,7 @@ def test_training_with_judge(self, config_name): @require_torch_accelerator @require_vllm @pytest.mark.slow - def test_training_with_vllm(self, config_name): + def test_training_with_vllm_server(self, config_name): def cleanup_vllm_communicator(trainer): """Clean up vLLM communicator to avoid conflicts between test runs""" try: @@ -258,6 +258,7 @@ def cleanup_vllm_communicator(trainer): training_args = OnlineDPOConfig( output_dir=self.tmp_dir, use_vllm=True, + vllm_mode="server", vllm_gpu_memory_utilization=0.2, report_to="none", ) @@ -351,7 +352,7 @@ def test_vllm_config_validation(self): # Test default values config = OnlineDPOConfig() - assert config.vllm_mode == "server" + assert config.vllm_mode == "colocate" assert config.vllm_server_base_url is None assert config.vllm_server_host == "0.0.0.0" assert config.vllm_server_port == 8000 diff --git a/trl/experimental/gold/gold_config.py b/trl/experimental/gold/gold_config.py index 41839318f7..26f98c4466 100644 --- a/trl/experimental/gold/gold_config.py +++ b/trl/experimental/gold/gold_config.py @@ -68,7 +68,7 @@ class GOLDConfig(SFTConfig): Whether to skip EOS token for teacher in ULD loss computation. use_vllm (`bool`, *optional*, defaults to `False`): Whether to use vLLM for generating completions from the student model. Requires `vllm` to be installed. - vllm_mode (`str`, *optional*, defaults to `"server"`): + vllm_mode (`str`, *optional*, defaults to `"colocate"`): Mode for student vLLM integration. Either `"server"` (connect to a running TRL vLLM server) or `"colocate"` (run vLLM in the same process). vllm_server_host (`str`, *optional*, defaults to `"0.0.0.0"`): @@ -274,7 +274,7 @@ class GOLDConfig(SFTConfig): metadata={"help": "Whether to use vLLM for generating completions. Requires `vllm` to be installed."}, ) vllm_mode: str = field( - default="server", + default="colocate", metadata={ "help": 'Mode for vLLM integration. Either "server" (connect to a running TRL vLLM server) or "colocate" (run vLLM in the same process).' }, diff --git a/trl/experimental/online_dpo/online_dpo_config.py b/trl/experimental/online_dpo/online_dpo_config.py index 10bd72d120..aaffac55af 100644 --- a/trl/experimental/online_dpo/online_dpo_config.py +++ b/trl/experimental/online_dpo/online_dpo_config.py @@ -101,7 +101,7 @@ class may differ from those in [`~transformers.TrainingArguments`]. Model implementation to use for vLLM. Must be one of `"transformers"` or `"vllm"`. `"transformers"`: Use the `transformers` backend for model implementation. `"vllm"`: Use the `vllm` library for model implementation. - vllm_mode (`str`, *optional*, defaults to `"server"`): + vllm_mode (`str`, *optional*, defaults to `"colocate"`): Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `"server"` or `"colocate"`. @@ -303,7 +303,7 @@ class may differ from those in [`~transformers.TrainingArguments`]. }, ) vllm_mode: str = field( - default="server", + default="colocate", metadata={ "help": "Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `'server'` or " "`'colocate'`. `'server'`: The trainer will send generation requests to a separate vLLM server. Make sure " diff --git a/trl/generation/vllm_generation.py b/trl/generation/vllm_generation.py index 0e777811ae..f93d3c323f 100644 --- a/trl/generation/vllm_generation.py +++ b/trl/generation/vllm_generation.py @@ -118,13 +118,14 @@ class VLLMGeneration: > Parameters for vLLM: - mode (`str`, *optional*, defaults to `"server"`): vLLM mode. Must be one of `"server"` or - `"colocate"`. + mode (`str`, *optional*, defaults to `"colocate"`): + vLLM mode. Must be one of `"colocate"` or `"server"`. - - `"server"`: The trainer will send generation requests to a separate vLLM server. Make sure a TRL vLLM - server is running (start with `trl vllm-serve`). - `"colocate"`: vLLM will run in the same process and share the training GPUs. This avoids the need for a separate server but may cause resource contention with training. + - `"server"`: The trainer will send generation requests to a separate vLLM server. Make sure a TRL vLLM + server is running (start with `trl vllm-serve`). + structured_outputs_regex (`str`, *optional*): Regex for vLLM structured outputs. If `None` (default), structured outputs is disabled. @@ -219,7 +220,7 @@ def __init__( is_fsdp_enabled: bool, processing_class: PreTrainedTokenizerBase | ProcessorMixin, # vLLM configuration - mode: str = "server", + mode: str = "colocate", structured_outputs_regex: str | None = None, # Server mode configuration server_base_url: str | None = None, diff --git a/trl/trainer/grpo_config.py b/trl/trainer/grpo_config.py index b8c8549cf6..36147251f1 100644 --- a/trl/trainer/grpo_config.py +++ b/trl/trainer/grpo_config.py @@ -113,7 +113,7 @@ class GRPOConfig(_BaseConfig): use_vllm (`bool`, *optional*, defaults to `False`): Whether to use vLLM for generating completions. If set to `True`, the trainer will use vLLM for generation instead of the default model.generate(). Requires `vllm` to be installed. - vllm_mode (`str`, *optional*, defaults to `"server"`): + vllm_mode (`str`, *optional*, defaults to `"colocate"`): Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `"server"` or `"colocate"`. @@ -484,7 +484,7 @@ class GRPOConfig(_BaseConfig): }, ) vllm_mode: str = field( - default="server", + default="colocate", metadata={ "help": "Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `'server'` or " "`'colocate'`. `'server'`: The trainer will send generation requests to a separate vLLM server. Make sure " diff --git a/trl/trainer/rloo_config.py b/trl/trainer/rloo_config.py index 344af8833c..8cdb0335a5 100644 --- a/trl/trainer/rloo_config.py +++ b/trl/trainer/rloo_config.py @@ -108,7 +108,7 @@ class RLOOConfig(_BaseConfig): use_vllm (`bool`, *optional*, defaults to `False`): Whether to use vLLM for generating completions. If set to `True`, the trainer will use vLLM for generation instead of the default model.generate(). Requires `vllm` to be installed. - vllm_mode (`str`, *optional*, defaults to `"server"`): + vllm_mode (`str`, *optional*, defaults to `"colocate"`): Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `"server"` or `"colocate"`. @@ -367,7 +367,7 @@ class RLOOConfig(_BaseConfig): }, ) vllm_mode: str = field( - default="server", + default="colocate", metadata={ "help": "Mode to use for vLLM integration when `use_vllm` is set to `True`. Must be one of `'server'` or " "`'colocate'`. `'server'`: The trainer will send generation requests to a separate vLLM server. Make sure "