-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Open
Description
when I post-train like blow code, why the kl is nan always? Even if I explicitly add ref_model in GRPOTrainer, it still doesn't work.
and why loss is negative?
the logs like:
| Step | Training Loss | reward | reward_std | kl | entropy |
|---|---|---|---|---|---|
| 1 | -0.283600 | 13.437500 | 21.810942 | nan | 0 |
| 2 | -0.149100 | 6.931818 | 14.526671 | nan | No Log |
| 3 | -0.110500 | 6.250000 | 17.677670 | nan | No Log |
| 4 | -0.014500 | 6.422414 | 12.048451 | nan | No Log |
| Step | Training Loss | reward | reward_std | completions / mean_length | completions / min_length | completions / max_length | completions / clipped_ratio | completions / mean_terminated_length | completions / min_terminated_length | completions / max_terminated_length | kl | entropy | rewards / format_reward / mean | rewards / format_reward / std | rewards / sorted_events_reward / mean | rewards / sorted_events_reward / std | rewards / score_reward / mean | rewards / score_reward / std |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | -0.283600 | 13.437500 | 21.810942 | 1115.437500 | 253.000000 | 1600.000000 | 0.500000 | 630.875000 | 253.000000 | 1585.000000 | nan | 0 | 3.125000 | 4.787136 | 3.750000 | 8.062258 | 6.562500 | 19.036697 |
| 2 | -0.149100 | 6.931818 | 14.526671 | 1203.250000 | 235.000000 | 1600.000000 | 0.625000 | 542.000000 | 235.000000 | 1594.000000 | nan | No Log | 1.250000 | 3.415650 | 2.500000 | 6.831301 | 3.181818 | 12.727274 |
| 3 | -0.110500 | 6.250000 | 17.677670 | 1316.687500 | 255.000000 | 1600.000000 | 0.750000 | 466.750000 | 255.000000 | 1078.000000 | nan | No Log | 0.625000 | 2.500000 | 1.250000 | 5.000000 | 4.375000 | 17.500000 |
| 4 | -0.014500 | 6.422414 | 12.048451 | 1173.500000 | 131.000000 | 1600.000000 | 0.687500 | 235.199997 | 131.000000 | 468.000000 | nan | No Log | 0.000000 | 0.000000 | 2.500000 | 6.831301 | 3.922414 | 11.039421 |
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
learning_rate=1e-5,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.01,
lr_scheduler_type="cosine",
optim="paged_adamw_8bit",
logging_steps=1,
per_device_train_batch_size=8,
gradient_accumulation_steps=1,
num_generations=8, # Decrease if out of memory
max_prompt_length=max_prompt_length,
max_completion_length=max_seq_length - max_prompt_length,
max_grad_norm=0.1,
# report_to="wandb",
output_dir="/root/megrez-tmp/grpo_outputs2",
overwrite_output_dir=True,
# push_to_hub=False,
# hub_model_id=new_model_id,
# hub_strategy="every_save",
save_strategy="steps",
save_steps=50,
save_total_limit=1,
num_train_epochs=3,
)
trainer = GRPOTrainer(
# the model&ref_model is load by unsloth
model=model,
ref_model=ref_model, # 👈 关键!必须加!
processing_class=tokenizer,
reward_funcs=[
format_reward,
sorted_events_reward,
score_reward,
],
args=training_args,
train_dataset=ds,
callbacks=[swanlab_callback],
)
trainer.train()
Metadata
Metadata
Assignees
Labels
No labels