Due to the use of Hydra, verl's parameters are scattered throughout the framework, and the official parameter documentation lacks updates. Therefore, our SGLang RL team, in collaboration with the Amazon AGI SF Lab, has compiled a quick look at verl's parameters. Given the large number of parameters, it is difficult for us to guarantee that every interpretation is absolutely correct, but each has been repeatedly reviewed by us. After careful consideration, we decided to share this parameter quick look with the community, hoping it will be helpful to everyone. The contributors to this guide are:
Ji Li (Ant), Zhuoran Yin (CMU), Changyi Yang (CMU), Chengxi Li (CMU), Xinpeng Wei (Amazon), Chenyang Zhao (Amazon)
| Parameter Name | Detailed Explanation |
|---|---|
data.train_batch_size |
Function: Defines the number of samples sent to the Rollout Engine in a single training step. This is also the number of prompts sampled from the training dataset at the beginning of each PPO iteration. Detailed Explanation: This value is the fundamental sample count in RL training. For example, setting it to 1024 means that in one iteration: 1. 1024 prompts are randomly drawn from the dataset. 2. These 1024 prompts are sent to the current Rollout Engine to generate 1024 complete trajectories (prompt, response). 3. Next, these 1024 trajectories undergo experience calculation ("make experience"), which is subsequently used for updating the Actor and Critic models. Impact and Trade-offs: Affects the total number of samples trained. |
data.val_batch_size (Deprecated) |
Function: The batch size used during the validation phase. Detailed Explanation: Similar to train_batch_size, but only used for evaluating model performance and not for training. If set to null, the size of the validation set is used as the default. Note: This has been deprecated. It is recommended to set it to null. In this case, the entire validation dataset is sent to the SGLang engines at once, which will manage memory themselves. |
actor_rollout_ref.actor.ppo_mini_batch_size critic.ppo_mini_batch_size |
Function: Defines the mini-batch size for PPO training updates. Detailed Explanation: All the experience data collected with data.train_batch_size will be split into multiple mini-batches, with the size of each being ppo_mini_batch_size. The model performs a parameter update only after processing one mini-batch.For example, if train_batch_size = 1024 and ppo_mini_batch_size = 256, the model will perform 1024 / 256 = 4 parameter updates in one PPO Epoch.Impact and Trade-offs: Increasing the mini-batch size makes the gradients for a single update more stable, but it reduces the update frequency and the total number of updates. |
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu critic.ppo_micro_batch_size_per_gpu |
Function: Defines the size of data for a single forward/backward pass on a single GPU. Detailed Explanation: This is the core parameter for implementing gradient accumulation. A mini-batch is further divided into several micro-batches. For example, on a single GPU, if ppo_mini_batch_size = 256 and ppo_micro_batch_size_per_gpu = 32, the number of gradient accumulation steps is 256 / 32 = 8. This means the model will run 8 forward passes to get the loss, then backward passes to get the gradient, processing 32 samples each time, until the gradients for the entire mini-batch are accumulated. Then, the accumulated total gradient is used to update the model parameters once (optimizer.step()). This value must be strictly adjusted based on GPU memory size and is key to preventing OOM (Out of Memory) errors.Impact and Trade-offs: Increasing this value reduces the number of gradient accumulation steps, which can improve training throughput but increases GPU memory consumption. |
actor_rollout_ref.actor.ppo_micro_batch_size critic.ppo_micro_batch_size (Deprecated) |
Function: Deprecated. Replaced by the per_gpu version as it better accommodates distributed training environments. |
When sample lengths vary significantly, batching by the number of samples can lead to highly imbalanced computational loads across different batches. Controlling the batch size based on the total number of tokens is a solution to balance the training time for each batch.
| Parameter Name | Detailed Explanation |
|---|---|
actor_rollout_ref.actor.ppo_max_token_len_per_gpu critic.ppo_max_token_len_per_gpu |
Function: Defines the maximum total number of tokens that a single GPU can process in one PPO micro-batch. Detailed Explanation: This is an alternative to ppo_micro_batch_size_per_gpu and is used in conjunction with use_dynamic_bsz. The system automatically packs samples until the total token count (prompt_len + response_len) approaches this threshold, forming a dynamic micro-batch size. This helps stabilize computational efficiency; the computational load for each micro-batch remains relatively constant, regardless of sample length.For example, if actor_rollout_ref.actor.ppo_max_token_len_per_gpu = 16384, the system might pack 16 samples of length 1024 (16 * 1024 = 16384) or 64 samples of length 256 (64 * 256 = 16384).Impact and Trade-offs: Generally more efficient than fixed-sample-count micro-batches, leading to better utilization of computational resources and reducing GPU instability. Typically set to n * ({data.max_prompt_length} + {data.max_response_length}). |
reward_model.forward_max_token_len_per_gpu critic.forward_max_token_len_per_gpu actor_rollout_ref.ref.log_prob_max_token_len_per_gpu |
Function: The maximum number of tokens in a micro-batch for models that only perform forward computations. Detailed Explanation: Some models (Reward Model, Critic for value calculation, Reference Model for log probs) only perform forward passes during the "make experience" phase. At this point, the rollout engine has been offloaded, and the training engine has not yet started, resulting in very low GPU memory usage. Therefore, a larger batch size can be set for them to accelerate computation. These parameters are also part of use_dynamic_bsz and are used to optimize the execution efficiency of these specific tasks. |
critic.forward_micro_batch_size_per_gpu reward_model.micro_batch_size_per_gpu actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu |
Function: Similarly, sets the micro-batch size for models that only perform forward computations. Detailed Explanation: Same as the parameter in the row above. |
actor_rollout_ref.actor.use_dynamic_bsz critic.use_dynamic_bsz reward_model.use_dynamic_bsz |
Function: Whether to enable Dynamic Batch Size. Detailed Explanation: When this is True, the system ignores the sample-based micro_batch_size_per_gpu parameter and instead uses the token-based max_token_len_per_gpu parameter to construct batches. |
trainer.balance_batch |
Function: Whether to balance the batch size across different data parallel (dp) ranks in distributed training. Detailed Explanation: Reorders data on a single controller to ensure that each dp rank receives a similar number of tokens. |
| Parameter Name | Function and Explanation |
|---|---|
actor_rollout_ref.rollout.temperature |
A higher temperature value smooths the probability distribution, leading to more diverse and random generated results. A lower value sharpens the distribution, making the output more deterministic and conservative, favoring high-probability tokens. temperature=0 is usually equivalent to Greedy Decoding. |
actor_rollout_ref.rollout.top_k |
At each generation step, only the K most probable tokens are considered for sampling. For example, top_k=50 means selecting only from the top 50 most likely tokens.- To disable: Set to 0 or None in Hugging Face, or -1 in SGLang (which samples from the entire vocabulary). |
actor_rollout_ref.rollout.top_p |
Cumulatively sums the probabilities of the most likely tokens until the total probability reaches P, then samples from this nucleus set of tokens. It is a dynamic method for selecting the sampling range. top_p=1.0 means no restriction. |
actor_rollout_ref.rollout.use_fire_sampling |
Whether to use Fire Sampling, from a paper by ByteDance. |
actor_rollout_ref.rollout.n |
The number of responses generated for each prompt, also known as the group size in GRPO. |
actor_rollout_ref.rollout.ignore_eos |
Whether to ignore the EOS (End-of-Sentence) token. If True, generation continues until max_response_length is reached, even if the model produces an EOS token. |
| Parameter Name | Function and Explanation |
|---|---|
actor_rollout_ref.rollout.prompt_length |
The maximum prompt length. Prompts longer than this are truncated. |
actor_rollout_ref.rollout.response_length |
The maximum response length. The SGLang engine will return immediately upon reaching this length. |
actor_rollout_ref.rollout.dtype |
Model data type, e.g., bfloat16, float16. This needs to be aligned with the model type used in the training phase; otherwise, quantization will be required when updating model parameters. |
actor_rollout_ref.rollout.gpu_memory_utilization |
In SGLang, this is the proportion of GPU memory occupied by model parameters and the KV Cache. If using SGLang version 0.4.8.post1 or higher, this can be set to around 0.85. For older versions, it should be set to around 0.5. |
actor_rollout_ref.rollout.free_cache_engine |
Whether to free the engine cache after a rollout. Enabling this option in SGLang triggers the flush_cache() operation, which clears the KV cache pool and marks all slots as available. This releases the logical occupation of the KV Cache without freeing the physical GPU memory. For why flushing the KV cache is needed, see here. |
actor_rollout_ref.rollout.load_format |
Model weight loading mode. E.g., dummy_dtensor (randomly initialized weights for quick debugging), hf, safetensors (recommended for safety and efficiency). |
actor_rollout_ref.rollout.tensor_model_parallel_size (TP_SIZE) |
Tensor model parallel size, indicating how many GPUs are used to run a single SGLang engine. For example, TP_SIZE=4 means splitting a large model's weights into 4 parts, with 4 GPUs collaborating on inference. |
actor_rollout_ref.rollout.max_model_len |
The maximum total length (prompt + response) the model can handle. If not set, it is usually determined by the model's configuration. |
actor_rollout_ref.rollout.max_num_seqs |
The maximum number of requests the engine can process concurrently, or the maximum number of prompts being inferred simultaneously. |
actor_rollout_ref.rollout.enable_chunked_prefill |
Whether to enable Chunked Prefill. For very long prompts, this can split them into chunks for processing, which reduces peak memory usage at the cost of lower throughput. |
actor_rollout_ref.rollout.disable_log_stats |
Whether to disable the inference engine's statistical logs to reduce console output. |
| Parameter Name | Function and Explanation |
|---|---|
actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend |
The attention backend used by SGLang. Options like flashinfer, triton, flashmla, null are available to suit different graphics cards. |
These parameters are primarily for scenarios requiring multi-turn interactions, such as tool calling or continuous dialogue, supported by the SGLang Engine.
| Parameter Name | Function and Explanation |
|---|---|
actor_rollout_ref.rollout.multi_turn.enable |
Whether to enable multi-turn dialogue mode. |
actor_rollout_ref.rollout.multi_turn.max_turns |
The maximum number of tool calling rounds. If null, it defaults to max_model_len // 3 to prevent infinite dialogues. |
actor_rollout_ref.rollout.multi_turn.tool_config_path |
The path to the tool configuration file, which defines the external tools the model can call. |
actor_rollout_ref.rollout.multi_turn.completion_callback |
A custom callback function that can execute custom logic after each generation round. |
actor_rollout_ref.rollout.multi_turn.use_inference_chat_template |
Whether to use the model's chat template from the inference phase. True means following the inference-stage template format. False means using the template from pre-training, which may contain a complete token sequence with an additional thought process. For any model, it is crucial to ensure consistent templates are used during post-training and subsequent inference testing stages. |
actor_rollout_ref.rollout.multi_turn.enable_tokenization_sanity_check |
Whether to perform a tokenization sanity check, verifying that the result of tokenizing turn-by-turn is consistent with tokenizing the entire chat history at once. |
| Parameter Name | Function and Explanation |
|---|---|
actor_rollout_ref.rollout.val_kwargs.* |
Sampling parameters for the validation phase. This allows us to use different sampling parameters during post-training and validation. For example, during validation, it is common to set temperature=0 and do_sample=False for greedy decoding to obtain more stable evaluation results. |
| Parameter Name | Function and Explanation |
|---|---|
data.tokenizer |
The class or path of the tokenizer. If null, it will be automatically inferred from the model. |
data.use_shm |
Whether to use shared memory (SHM) to load data. |
data.train_files |
Training set Parquet files. Can be a list or a single file; paths can be local or HDFS paths. |
data.val_files |
Validation set Parquet files. Can be a list or a single file. |
data.prompt_key |
The field for the prompt in the dataset. Defaults to prompt. |
data.reward_fn_key |
The field used to select the reward function (if different reward functions are used for each sample). |
data.max_prompt_length |
Maximum prompt length. All prompts will be left-padded to this length. |
data.return_raw_input_ids |
Whether to return the raw input_ids without the chat template applied; used when the reward model's chat template differs from the policy model's. |
data.return_raw_chat |
Whether to return the raw response without the chat template applied. |
data.return_full_prompt |
Whether to return the full prompt with the chat template applied. |
data.shuffle |
Whether to shuffle the data in the DataLoader. |
data.validation_shuffle |
Whether to shuffle the validation set. |
data.filter_overlong_prompts |
Whether to filter out overly long prompts. |
data.filter_overlong_prompts_workers |
The number of worker processes for filtering overly long prompts. Use multiple processes for large datasets to speed up. Defaults to 1. |
data.truncation |
Truncate if input_ids or prompt exceeds the maximum length. |
data.image_key |
The field representing images in a multi-modal dataset. Defaults to images. |
data.video_key |
The field representing videos in a multi-modal dataset. |
data.trust_remote_code |
Whether to trust the local Hugging Face cache; note, this 'remote' is relative to Hugging Face, so this parameter considers "whether to trust local." |
data.custom_cls.path |
The file path containing the custom dataset class. If not specified, a pre-implemented default dataset will be used. |
data.custom_cls.name |
The name of the dataset class in the specified file. |
The parameters for Critic and Actor are very consistent and will not be repeated.
| Parameter Name | Description |
|---|---|
actor_rollout_ref.hybrid_engine |
Currently only supports hybrid engine, which places the actor and rollout models on the same resource group. |
actor_rollout_ref.model.path |
Hugging Face model path. Can be a local path or an HDFS path. |
actor_rollout_ref.model.use_shm |
Whether to use shared memory (SHM) to accelerate model weight loading. |
actor_rollout_ref.model.external_lib |
Additional Python packages for registering Hugging Face models/tokenizers. |
actor_rollout_ref.model.override_config |
Used to override the model's original configuration, mainly for dropout. |
actor_rollout_ref.model.enable_gradient_checkpointing |
Whether to recompute gradients during actor training, trading time for space. |
actor_rollout_ref.model.enable_activation_offload |
Whether to offload activations to the CPU during actor training. |
actor_rollout_ref.model.use_remove_padding |
Whether to remove padding tokens from the input during training. |
actor_rollout_ref.model.use_liger |
Whether to use the Liger kernel for linear layer fusion. |
actor_rollout_ref.model.use_fused_kernels |
Whether to use custom fused kernels (e.g., FlashAttention, fused MLP). |
actor_rollout_ref.model.fused_kernel_options.impl_backend |
The implementation backend for fused kernels, either triton or torch. Must be used with use_fused_kernels. |
actor_rollout_ref.model.trust_remote_code |
Whether to trust the local Hugging Face cache; note, this 'remote' is relative to Hugging Face, so this parameter considers "whether to trust local." |
actor_rollout_ref.actor.strategy |
Training backend: fsdp, fsdp2, or megatron. |
actor_rollout_ref.actor.grad_clip |
Gradient clipping for Actor updates. |
actor_rollout_ref.actor.clip_ratio |
PPO clipping ratio. |
actor_rollout_ref.actor.clip_ratio_low |
The lower bound for asymmetric clipping (for dual-clip PPO). |
actor_rollout_ref.actor.clip_ratio_high |
The upper bound for asymmetric clipping (for dual-clip PPO). |
actor_rollout_ref.actor.clip_ratio_c |
The constant C in dual-clip PPO; clipping occurs when advantage < -C. |
actor_rollout_ref.actor.loss_agg_mode |
Loss aggregation mode: token-mean, seq-mean-token-sum, or seq-mean-token-mean. |
actor_rollout_ref.actor.entropy_coeff |
The entropy regularization coefficient in the PPO loss. |
actor_rollout_ref.actor.use_kl_loss |
Whether to use KL loss instead of a KL reward penalty. True for GRPO. |
actor_rollout_ref.actor.use_torch_compile |
Whether to use torch.compile(). |
actor_rollout_ref.actor.kl_loss_coef |
The KL loss coefficient when use_kl_loss is enabled, used for GRPO. |
actor_rollout_ref.actor.kl_loss_type |
The type of KL divergence loss. Options: kl, abs, mse, low_var_kl, full. |
actor_rollout_ref.actor.ppo_epochs |
The number of PPO epochs. |
actor_rollout_ref.actor.shuffle |
Shuffle the training data. |
actor_rollout_ref.actor.ulysses_sequence_parallel_size |
The sequence parallel size for Ulysses-style parallelism. |
actor_rollout_ref.actor.entropy_from_logits_with_chunking |
Compute entropy via chunking to reduce peak memory usage. |
actor_rollout_ref.actor.entropy_checkpointing |
Whether to save entropy via checkpointing. |
actor_rollout_ref.actor.checkpoint.save_contents |
The contents to be included in the saved checkpoint. |
actor_rollout_ref.actor.checkpoint.load_contents |
The specific contents to load from a checkpoint. |
actor_rollout_ref.actor.optim.lr |
Learning rate. |
actor_rollout_ref.actor.optim.lr_warmup_steps |
Number of warmup steps; a negative value means it's determined by lr_warmup_steps_ratio. |
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio |
The ratio of warmup steps (used when lr_warmup_steps is negative). |
actor_rollout_ref.actor.optim.min_lr_ratio |
The minimum learning rate ratio for the cosine scheduler. |
actor_rollout_ref.actor.optim.num_cycles |
The number of cosine cycles in the learning rate schedule. |
actor_rollout_ref.actor.optim.warmup_style |
Learning rate warmup style: constant or cosine. |
actor_rollout_ref.actor.optim.total_training_steps |
Total number of training steps. |
actor_rollout_ref.actor.optim.weight_decay |
Weight decay coefficient, controlling the strength of L2 regularization applied to weights during training. |
actor_rollout_ref.actor.fsdp_config.wrap_policy.min_num_params |
The minimum number of parameters to trigger FSDP wrapping for a layer. |
actor_rollout_ref.actor.fsdp_config.param_offload |
Whether to offload model parameters to the CPU (trading speed for memory). |
actor_rollout_ref.actor.fsdp_config.optimizer_offload |
Whether to offload optimizer states to the CPU. |
actor_rollout_ref.actor.fsdp_config.offload_policy |
For FSDP2 only: Offload parameters/gradients/optimizer during training. |
actor_rollout_ref.actor.fsdp_config.reshard_after_forward |
For FSDP2 only: Reshard after the forward pass to reduce memory usage. |
actor_rollout_ref.actor.fsdp_config.fsdp_size |
The number of GPUs in each FSDP sharding group; -1 means automatic. |
actor_rollout_ref.actor.fsdp_config.forward_prefetch |
For FSDP1 only: Prefetch the all-gather for the next forward pass before the current one completes. |
actor_rollout_ref.actor.profiler.discrete |
True means each task has its own database; False means all tasks share one. |
actor_rollout_ref.actor.profiler.all_ranks |
Whether to profile all ranks. |
actor_rollout_ref.actor.profiler.ranks |
The ranks to be profiled. null or [0,1,...]. |
actor_rollout_ref.ref.strategy |
FSDP configuration for the Reference model, same as the actor. |
actor_rollout_ref.ref.fsdp_config.param_offload |
Whether to offload parameters in FSDP. |
actor_rollout_ref.ref.fsdp_config.reshard_after_forward |
For FSDP2 only: Whether to reshard after the model's forward pass to save memory. |
actor_rollout_ref.ref.fsdp_config.forward_prefetch |
For FSDP1 only: Prefetch the all-gather for the next forward pass before the current one completes. |
actor_rollout_ref.ref.fsdp_config.wrap_policy.min_num_params |
The minimum number of parameters in an FSDP-wrapped module. |
actor_rollout_ref.ref.profiler.discrete |
True means each task has its own database; False means all tasks share one. |
actor_rollout_ref.ref.profiler.all_ranks |
Whether to profile all ranks. |
actor_rollout_ref.ref.profiler.ranks |
The ranks to be profiled. null or [0,1,...]. |
| Parameter Name | Description |
|---|---|
reward_model.enable |
Whether to enable the reward model. If False, rewards are calculated only using user-defined reward functions. |
reward_model.strategy |
FSDP strategy: fsdp, fsdp2, or megatron. |
reward_model.model.input_tokenizer |
Input tokenizer. Required if the reward model's chat template is inconsistent with the policy's. |
reward_model.model.path |
The HDFS or local path to the RM. Only AutoModelForSequenceClassification is supported. |
reward_model.model.use_shm |
Whether to use shared memory to load the model. |
reward_model.model.external_lib |
External model implementation (optional). |
reward_model.model.use_remove_padding |
Use remove padding optimization (saves computation). |
reward_model.model.use_fused_kernels |
Whether to use fused reward kernels for acceleration. |
reward_model.model.trust_remote_code |
Whether to allow loading models with remote code, defaults to False. |
reward_model.model.fsdp_config.wrap_policy.min_num_params |
The minimum number of parameters to trigger FSDP wrapping. |
reward_model.model.fsdp_config.param_offload |
Whether to offload model parameters to the CPU. |
reward_model.model.fsdp_config.reshard_after_forward |
For FSDP2 only: Reshard after the forward pass to reduce memory usage. |
reward_model.model.fsdp_config.fsdp_size |
The number of GPUs in each FSDP sharding group; -1 means automatic. |
reward_model.model.fsdp_config.forward_prefetch |
For FSDP1 only: Prefetch the all-gather for the next forward pass before the current one completes. |
reward_model.reward_manager |
Defines the mechanism for calculating rule-based rewards and handling different reward sources. |
reward_model.launch_reward_fn_async |
Whether to launch custom reward functions asynchronously during the log_prob phase. |
reward_model.sandbox_fusion.url |
The URL for remote reward functions. |
reward_model.sandbox_fusion.max_concurrent |
The maximum number of concurrent requests allowed to the sandbox. |
reward_model.profiler.discrete |
True means each task has its own database; False means all tasks share one. |
| Parameter Name | Description |
|---|---|
custom_reward_function.path |
The file path containing the custom reward function. |
custom_reward_function.name |
The name of the reward function in the specified file. Defaults to compute_score. |
| Parameter Name | Description |
|---|---|
algorithm.gamma |
Discount factor for future rewards. |
algorithm.lam |
The trade-off between bias and variance in the GAE estimator. |
algorithm.adv_estimator |
The type of advantage estimator: gae, grpo, reinforce_plus_plus, etc. |
algorithm.norm_adv_by_std_in_grpo |
Whether to normalize advantage by its standard deviation in GRPO. |
algorithm.use_kl_in_reward |
Whether to enable KL penalty in the reward. |
algorithm.kl_penalty |
How to estimate KL divergence: kl, abs, mse, low_var_kl, or full. |
algorithm.kl_ctrl.type |
KL control type: fixed or adaptive. |
algorithm.kl_ctrl.kl_coef |
The initial coefficient for the KL penalty. |
algorithm.kl_ctrl.horizon |
The horizon value for the adaptive controller (if enabled). |
algorithm.kl_ctrl.target_kl |
The target KL divergence (for the adaptive controller). |
algorithm.use_pf_ppo |
Whether to enable Preference-Feedback PPO. |
algorithm.pf_ppo.reweight_method |
Sample re-weighting method: pow, max_min, or max_random. |
algorithm.pf_ppo.weight_pow |
The power used for weight scaling in the pow method. |
| Parameter Name | Description |
|---|---|
trainer.balance_batch |
Whether to balance batch sizes across distributed worker nodes. |
trainer.total_epochs |
The total number of training epochs. |
trainer.total_training_steps |
Total training steps (can be set explicitly or derived from epochs). |
trainer.profile_steps |
The steps to be profiled. null means no profiling. |
trainer.controller_nsight_options.trace |
For the controller process, selects the APIs to trace (e.g., cuda, nvtx, cublas, etc.). |
trainer.controller_nsight_options.cuda-memory-usage |
For the controller process, whether to profile CUDA memory usage. Must be the string "true" or "false". |
trainer.controller_nsight_options.cuda-graph-trace |
For the controller process, whether CUDA graphs will be traced as a whole. |
trainer.worker_nsight_options.trace |
For worker processes, selects the APIs to trace. |
trainer.worker_nsight_options.cuda-memory-usage |
For worker processes, whether to profile CUDA memory usage. Must be the string "true" or "false". |
trainer.worker_nsight_options.cuda-graph-trace |
For worker processes, whether CUDA graphs will be traced as a whole. |
trainer.worker_nsight_options.capture-range |
Profile only within the torch.cuda.profiler.start and stop range. Default is cudaProfilerApi, do not change this setting. |
trainer.worker_nsight_options.capture-range-end |
Specifies the desired behavior when the capture range ends. |
trainer.worker_nsight_options.kill |
Sends a signal to the target application's process group. We let the program exit on its own. |
trainer.project_name |
The project name for experiment tracking (e.g., wandb). |
trainer.experiment_name |
The experiment name to identify the run in tracking tools. |
trainer.logger |
The logging backend to use: console, wandb, etc. |
trainer.log_val_generations |
The number of generations to log during validation. |
trainer.rollout_data_dir |
The directory to log rollout data; if null, data is not dumped. |
trainer.validation_data_dir |
The directory to log validation data; if null, data is not dumped. |
trainer.nnodes |
The number of nodes used in training. |
trainer.n_gpus_per_node |
The number of GPUs per node. |
trainer.save_freq |
The frequency of saving model checkpoints (in number of iterations). |
trainer.resume_mode |
Resume mode: auto, disable, or resume_path. |
trainer.resume_from_path |
Resume training from this path (used only if resume_mode is resume_path). |
trainer.val_before_train |
Whether to run validation before training starts. |
trainer.val_only |
Whether to run only validation. |
trainer.test_freq |
The validation frequency (in number of training iterations). |
trainer.critic_warmup |
The number of iterations to pre-warm the critic before updating the policy. |
trainer.default_hdfs_dir |
The default distributed file system path for saving checkpoints. |
trainer.del_local_ckpt_after_load |
Whether to delete local checkpoints after loading. |
trainer.default_local_dir |
The default local directory for saving checkpoints. |
trainer.max_actor_ckpt_to_keep |
The maximum number of actor checkpoints to keep. |
trainer.max_critic_ckpt_to_keep |
The maximum number of critic checkpoints to keep. |
trainer.ray_wait_register_center_timeout |
The timeout (in seconds) for Ray workers to wait for registration. |
trainer.device |
The device to run training on (e.g., cuda, cpu). |
| Parameter Name | Description |
|---|---|
ray_init.num_cpus |
The number of CPUs for Ray to use. A fixed number should be used instead of null when using SLURM. |
ray_init.timeline_json_file |
The path to save the Ray timeline JSON file for performance analysis. |