Open
Description
Hi, I am running into an error when I use the GRPO trainer with a reward_func
that is a pretrained reward model instead of the custom reward functions. Its throws the following error:
assert padding_idx < weight.size(0), "Padding_idx must be within num_embeddings"
AssertionError: Padding_idx must be within num_embeddings
But I did check and made sure paddind_idx is within weight.size(0) (i.e. num_embeddings). I was able to reproduce this error using the minimal example provided in the GRPO docs by replacing the custom reward func with a pretrained reward model (trl-lib/Qwen2-0.5B-Reward)
# test_grpo.py
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from cgrpo_trainer import CustomGRPOTrainer
dataset = load_dataset("trl-lib/tldr", split="train")
# Define the reward function, which rewards completions that are close to 20 characters
def reward_len(completions, **kwargs):
return [-abs(20 - len(completion)) for completion in completions]
training_args = GRPOConfig(output_dir="data/Qwen2-0.5B-GRPO",
logging_steps=10,
per_device_train_batch_size=2,
bf16=True,
num_generations=2,
max_prompt_length=128,
max_completion_length=128,
)
trainer = GRPOTrainer(
model= "Qwen/Qwen2-0.5B-Instruct",
reward_funcs='trl-lib/Qwen2-0.5B-Reward', # reward_len,
args=training_args,
train_dataset=dataset,
)
trainer.train()
I run the script with deepspeed using trl v0.16.0.