Skip to content

why kl = nan when grpo train? #704

@uilstong

Description

@uilstong

when I post-train like blow code, why the kl is nan always? Even if I explicitly add ref_model in GRPOTrainer, it still doesn't work.
and why loss is negative?
the logs like:

Step Training Loss reward reward_std kl entropy
1 -0.283600 13.437500 21.810942 nan 0
2 -0.149100 6.931818 14.526671 nan No Log
3 -0.110500 6.250000 17.677670 nan No Log
4 -0.014500 6.422414 12.048451 nan No Log
Step Training Loss reward reward_std completions / mean_length completions / min_length completions / max_length completions / clipped_ratio completions / mean_terminated_length completions / min_terminated_length completions / max_terminated_length kl entropy rewards / format_reward / mean rewards / format_reward / std rewards / sorted_events_reward / mean rewards / sorted_events_reward / std rewards / score_reward / mean rewards / score_reward / std
1 -0.283600 13.437500 21.810942 1115.437500 253.000000 1600.000000 0.500000 630.875000 253.000000 1585.000000 nan 0 3.125000 4.787136 3.750000 8.062258 6.562500 19.036697
2 -0.149100 6.931818 14.526671 1203.250000 235.000000 1600.000000 0.625000 542.000000 235.000000 1594.000000 nan No Log 1.250000 3.415650 2.500000 6.831301 3.181818 12.727274
3 -0.110500 6.250000 17.677670 1316.687500 255.000000 1600.000000 0.750000 466.750000 255.000000 1078.000000 nan No Log 0.625000 2.500000 1.250000 5.000000 4.375000 17.500000
4 -0.014500 6.422414 12.048451 1173.500000 131.000000 1600.000000 0.687500 235.199997 131.000000 468.000000 nan No Log 0.000000 0.000000 2.500000 6.831301 3.922414 11.039421


from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    learning_rate=1e-5,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.01,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    num_generations=8,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    max_grad_norm=0.1,
    # report_to="wandb",
    output_dir="/root/megrez-tmp/grpo_outputs2",
    overwrite_output_dir=True,
    # push_to_hub=False,
    # hub_model_id=new_model_id,
    # hub_strategy="every_save",
    save_strategy="steps",
    save_steps=50,
    save_total_limit=1,
    num_train_epochs=3,
)
trainer = GRPOTrainer(
# the model&ref_model is load by unsloth
    model=model,
    ref_model=ref_model,  # 👈 关键!必须加!
    processing_class=tokenizer,
    reward_funcs=[
        format_reward,
        sorted_events_reward,
        score_reward,
    ],
    args=training_args,
    train_dataset=ds,
    callbacks=[swanlab_callback],
)
trainer.train()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions