why kl = nan when grpo train?

when I post-train like blow code, why the kl is nan always? Even if I explicitly add ref_model in `GRPOTrainer`, it still doesn't work.
and why loss is negative?
the logs like:

Step | Training Loss | reward | reward_std | kl | entropy
-- | -- | -- | -- | -- | --
1 | -0.283600 | 13.437500 | 21.810942 | nan | 0
2 | -0.149100 | 6.931818 | 14.526671 | nan | No Log
3 | -0.110500 | 6.250000 | 17.677670 | nan | No Log
4 | -0.014500 | 6.422414 | 12.048451 | nan | No Log

Step | Training Loss | reward | reward_std | completions / mean_length | completions / min_length | completions / max_length | completions / clipped_ratio | completions / mean_terminated_length | completions / min_terminated_length | completions / max_terminated_length | kl | entropy | rewards / format_reward / mean | rewards / format_reward / std | rewards / sorted_events_reward / mean | rewards / sorted_events_reward / std | rewards / score_reward / mean | rewards / score_reward / std
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | -0.283600 | 13.437500 | 21.810942 | 1115.437500 | 253.000000 | 1600.000000 | 0.500000 | 630.875000 | 253.000000 | 1585.000000 | nan | 0 | 3.125000 | 4.787136 | 3.750000 | 8.062258 | 6.562500 | 19.036697
2 | -0.149100 | 6.931818 | 14.526671 | 1203.250000 | 235.000000 | 1600.000000 | 0.625000 | 542.000000 | 235.000000 | 1594.000000 | nan | No Log | 1.250000 | 3.415650 | 2.500000 | 6.831301 | 3.181818 | 12.727274
3 | -0.110500 | 6.250000 | 17.677670 | 1316.687500 | 255.000000 | 1600.000000 | 0.750000 | 466.750000 | 255.000000 | 1078.000000 | nan | No Log | 0.625000 | 2.500000 | 1.250000 | 5.000000 | 4.375000 | 17.500000
4 | -0.014500 | 6.422414 | 12.048451 | 1173.500000 | 131.000000 | 1600.000000 | 0.687500 | 235.199997 | 131.000000 | 468.000000 | nan | No Log | 0.000000 | 0.000000 | 2.500000 | 6.831301 | 3.922414 | 11.039421

<br class="Apple-interchange-newline">

```

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    learning_rate=1e-5,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.01,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    num_generations=8,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    max_grad_norm=0.1,
    # report_to="wandb",
    output_dir="/root/megrez-tmp/grpo_outputs2",
    overwrite_output_dir=True,
    # push_to_hub=False,
    # hub_model_id=new_model_id,
    # hub_strategy="every_save",
    save_strategy="steps",
    save_steps=50,
    save_total_limit=1,
    num_train_epochs=3,
)
trainer = GRPOTrainer(
# the model&ref_model is load by unsloth
    model=model,
    ref_model=ref_model,  # 👈 关键！必须加！
    processing_class=tokenizer,
    reward_funcs=[
        format_reward,
        sorted_events_reward,
        score_reward,
    ],
    args=training_args,
    train_dataset=ds,
    callbacks=[swanlab_callback],
)
trainer.train()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why kl = nan when grpo train? #704

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Step	Training Loss	reward	reward_std	kl	entropy
1	-0.283600	13.437500	21.810942	nan	0
2	-0.149100	6.931818	14.526671	nan	No Log
3	-0.110500	6.250000	17.677670	nan	No Log
4	-0.014500	6.422414	12.048451	nan	No Log

why kl = nan when grpo train? #704

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions