While running grpo_fast.py locally on my laptop (VRAM 16GB, RAM 32GB), I found that CPU offloading of optimizer states was a huge element for the training to succeed in my setup.
I thought then if these implementations were toggled, it’d be a small quality-of-life improvement, making these options configurable via CLI (especially grpo_fast.sh).
Specifically:
—use_cpu_adam
—offload_optimizer_to_cpu
—offload_params_to_cpu optional
—zero_force_ds_cpu_optimizer optional
I’ve implemented and tested this locally and confirmed training completes successfully. This isn’t a critical feature, but it would make local experimentation a bit easier.
Thanks for maintaining this project!
uv run python open_instruct/grpo_fast.py \
--dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \
--dataset_mixer_list_splits train \
--dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \
--dataset_mixer_eval_list_splits train \
--max_token_length 512 \
--max_prompt_token_length 256 \
--response_length 256 \
--pack_length 512 \
--per_device_train_batch_size 1 \
--num_unique_prompts_rollout 2 \
--num_samples_per_prompt_rollout 8 \
--model_name_or_path Qwen/Qwen3-0.6B \
--stop_strings "</answer>" \
--apply_r1_style_format_reward \
--apply_verifiable_reward \
--temperature 0.8 \
--chat_template_name r1_simple_chat_postpend_think \
--learning_rate 3e-7 \
--total_episodes 100000 \
--deepspeed_stage 2 \
--use_cpu_adam true \
--offload_optimizer_to_cpu true \
--offload_params_to_cpu false \
--num_epochs 2 \
--num_learners_per_node 1 \
--vllm_tensor_parallel_size 1 \
--beta 0.01 \
--clip_higher 0.28 \
--seed 3 \
--local_eval_every 150 \
--vllm_sync_backend gloo \
--vllm_gpu_memory_utilization 0.30 \
--gather_whole_model false \
--async_steps 1 \
--save_traces \
--vllm_enforce_eager \
--gradient_checkpointing \
--single_gpu_mode true \
--with_tracking \
--save_freq 0 \
--wandb_project grpo-qwen0.6b-gsm8k-v5 \
Here’s a quick diff snippet—just in case it’s useful.
@@ class Args:
fused_optimizer: bool = False
"""Whether to use fused optimizer"""
+ use_cpu_adam: bool = False
+ """Whether to use DeepSpeedCPUAdam"""
+ offload_optimizer_to_cpu: bool = False
+ """Whether to offload optimizer state to CPU"""
+ offload_params_to_cpu: bool = False
+ """Whether to offload model parameters to CPU"""
+ zero_force_ds_cpu_optimizer: Optional[bool] = None
+ """Force DeepSpeed CPU optimizer when offloading; if None, use default"""
@@ class PolicyTrainerRayProcess(...).from_pretrained(...):
- ds_config = get_train_ds_config(offload=False, adam_offload=False,
- stage=args.deepspeed_stage, bf16=True)
+ ds_config = get_train_ds_config(offload=args.offload_params_to_cpu,
+ adam_offload=args.offload_optimizer_to_cpu,
+ stage=args.deepspeed_stage, bf16=True)
+ if "zero_optimization" in ds_config:
+ ds_config["zero_optimization"].setdefault("offload_param", {})
+ ds_config["zero_optimization"].setdefault("offload_optimizer", {})
+ ds_config["zero_optimization"]["offload_param"]["device"] = (
+ "cpu" if args.offload_params_to_cpu else "none"
+ )
+ ds_config["zero_optimization"]["offload_param"].setdefault("pin_memory", False)
+ ds_config["zero_optimization"]["offload_optimizer"]["device"] = (
+ "cpu" if args.offload_optimizer_to_cpu else "none"
+ )
+ ds_config["zero_optimization"]["offload_optimizer"].setdefault("pin_memory", False)
+ if args.zero_force_ds_cpu_optimizer is not None:
+ ds_config["zero_optimization"]["zero_force_ds_cpu_optimizer"] = (
+ args.zero_force_ds_cpu_optimizer
+ )
@@ class PolicyTrainerRayProcess(...).from_pretrained(...):
- # self.optimizer = AdamOptimizer(optim_params, lr=args.learning_rate)
- self.optimizer = torch.optim.AdamW(optim_params, lr=args.learning_rate, fused=args.fused_optimizer)
+ from deepspeed.ops.adam import DeepSpeedCPUAdam
+ if args.use_cpu_adam:
+ self.optimizer = DeepSpeedCPUAdam(optim_params, lr=args.learning_rate)
+ else:
+ self.optimizer = torch.optim.AdamW(optim_params, lr=args.learning_rate, fused=args.fused_optimizer)

While running grpo_fast.py locally on my laptop (VRAM 16GB, RAM 32GB), I found that CPU offloading of optimizer states was a huge element for the training to succeed in my setup.
I thought then if these implementations were toggled, it’d be a small quality-of-life improvement, making these options configurable via CLI (especially grpo_fast.sh).
Specifically:
I’ve implemented and tested this locally and confirmed training completes successfully. This isn’t a critical feature, but it would make local experimentation a bit easier.
Thanks for maintaining this project!
Here’s a quick diff snippet—just in case it’s useful.