Open
Description
Hi maintainers of open-r1. I just wanted to make a comment on the latest repetition penalty after talking with one of the authors of the Demystifying long CoT paper.
I just noticed that you imported the wrong repetition penalty from the Demystifying Long CoT code. The repetition penalty that was actually used in the paper was the class RepetitionDensePenalty.
This may be a problem because the difference in implementation is global vs token-level penalty.
I had an LLM generate a feature difference:
Feature | RepetitionDensePenalty |
get_repetition_penalty |
---|---|---|
Granularity | Token-level penalties | Global sequence penalty |
Penalty Type | Fixed value per token | Scaled based on repetition |
Implementation | Modifies token rewards directly | Computes a single penalty score |
Use Case | Reinforcement learning (RL) reward models | Global repetition control in scoring |
Effect on Rewards | Directly affects token-level reward tensor | Adjusts overall sequence score |
Metadata
Metadata
Assignees
Labels
No labels