Skip to content

Feature request: Add CLI toggles for CPU offloading in grpo_fast.py #1031

@ShotaKaji5207

Description

@ShotaKaji5207

While running grpo_fast.py locally on my laptop (VRAM 16GB, RAM 32GB), I found that CPU offloading of optimizer states was a huge element for the training to succeed in my setup.

I thought then if these implementations were toggled, it’d be a small quality-of-life improvement, making these options configurable via CLI (especially grpo_fast.sh).

Specifically:

—use_cpu_adam
—offload_optimizer_to_cpu
—offload_params_to_cpu optional
—zero_force_ds_cpu_optimizer optional

I’ve implemented and tested this locally and confirmed training completes successfully. This isn’t a critical feature, but it would make local experimentation a bit easier.

Thanks for maintaining this project!

uv run python open_instruct/grpo_fast.py \
    --dataset_mixer_list ai2-adapt-dev/rlvr_gsm8k_zs 64 \
    --dataset_mixer_list_splits train \
    --dataset_mixer_eval_list ai2-adapt-dev/rlvr_gsm8k_zs 16 \
    --dataset_mixer_eval_list_splits train \
    --max_token_length 512 \
    --max_prompt_token_length 256 \
    --response_length 256 \
    --pack_length 512 \
    --per_device_train_batch_size 1 \
    --num_unique_prompts_rollout 2 \
    --num_samples_per_prompt_rollout 8 \
    --model_name_or_path Qwen/Qwen3-0.6B \
    --stop_strings "</answer>" \
    --apply_r1_style_format_reward \
    --apply_verifiable_reward \
    --temperature 0.8 \
    --chat_template_name r1_simple_chat_postpend_think \
    --learning_rate 3e-7 \
    --total_episodes 100000 \
    --deepspeed_stage 2 \
    --use_cpu_adam true \
    --offload_optimizer_to_cpu true \
    --offload_params_to_cpu false \
    --num_epochs 2 \
    --num_learners_per_node 1 \
    --vllm_tensor_parallel_size 1 \
    --beta 0.01 \
    --clip_higher 0.28 \
    --seed 3 \
    --local_eval_every 150 \
    --vllm_sync_backend gloo \
    --vllm_gpu_memory_utilization 0.30 \
    --gather_whole_model false \
    --async_steps 1 \
    --save_traces \
    --vllm_enforce_eager \
    --gradient_checkpointing \
    --single_gpu_mode true \
    --with_tracking \
    --save_freq 0 \
    --wandb_project grpo-qwen0.6b-gsm8k-v5 \

Here’s a quick diff snippet—just in case it’s useful.

@@ class Args:
     fused_optimizer: bool = False
     """Whether to use fused optimizer"""
+    use_cpu_adam: bool = False
+    """Whether to use DeepSpeedCPUAdam"""
+    offload_optimizer_to_cpu: bool = False
+    """Whether to offload optimizer state to CPU"""
+    offload_params_to_cpu: bool = False
+    """Whether to offload model parameters to CPU"""
+    zero_force_ds_cpu_optimizer: Optional[bool] = None
+    """Force DeepSpeed CPU optimizer when offloading; if None, use default"""

@@ class PolicyTrainerRayProcess(...).from_pretrained(...):
-        ds_config = get_train_ds_config(offload=False, adam_offload=False,
-                                        stage=args.deepspeed_stage, bf16=True)
+        ds_config = get_train_ds_config(offload=args.offload_params_to_cpu,
+                                        adam_offload=args.offload_optimizer_to_cpu,
+                                        stage=args.deepspeed_stage, bf16=True)
+        if "zero_optimization" in ds_config:
+            ds_config["zero_optimization"].setdefault("offload_param", {})
+            ds_config["zero_optimization"].setdefault("offload_optimizer", {})
+            ds_config["zero_optimization"]["offload_param"]["device"] = (
+                "cpu" if args.offload_params_to_cpu else "none"
+            )
+            ds_config["zero_optimization"]["offload_param"].setdefault("pin_memory", False)
+            ds_config["zero_optimization"]["offload_optimizer"]["device"] = (
+                "cpu" if args.offload_optimizer_to_cpu else "none"
+            )
+            ds_config["zero_optimization"]["offload_optimizer"].setdefault("pin_memory", False)
+            if args.zero_force_ds_cpu_optimizer is not None:
+                ds_config["zero_optimization"]["zero_force_ds_cpu_optimizer"] = (
+                    args.zero_force_ds_cpu_optimizer
+                )

@@ class PolicyTrainerRayProcess(...).from_pretrained(...):
-        # self.optimizer = AdamOptimizer(optim_params, lr=args.learning_rate)
-        self.optimizer = torch.optim.AdamW(optim_params, lr=args.learning_rate, fused=args.fused_optimizer)
+        from deepspeed.ops.adam import DeepSpeedCPUAdam
+        if args.use_cpu_adam:
+            self.optimizer = DeepSpeedCPUAdam(optim_params, lr=args.learning_rate)
+        else:
+            self.optimizer = torch.optim.AdamW(optim_params, lr=args.learning_rate, fused=args.fused_optimizer)

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    PR welcome!This is not on the roadmap of the `open-instruct` team, but we welcome outside contributions.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions