Reproducing JustRL


Thank you for your excellent work on JustRL. I really appreciate the simplicity of the approach and am very interested in reproducing the results using VERL.

However, I have encountered several issues while attempting to reproduce the reported results.

Following your recommended settings in this comment:
https://github.com/thunlp/JustRL/issues/3#issuecomment-3689198606

I used VERL version 0.2.0 or 0.2.0.post2 and ran the provided training script. During execution, I encountered multiple configuration validation errors. Specifically, more than 10 configuration keys referenced in the script do not exist in VERL 0.2.0. One example error is shown below:


```
Could not override 'algorithm.use_kl_in_reward'.
To append to your config use +algorithm.use_kl_in_reward=False
Key 'use_kl_in_reward' is not in struct
    full_key: algorithm.use_kl_in_reward
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
```

The following configuration options do not appear to exist in VERL 0.2.0. To pass configuration validation, I had to remove them from the script; however, this may result in inconsistencies with the setup you used:

- algorithm.use_kl_in_reward  
- data.filter_overlong_prompts  
- data.truncation  
- actor_rollout_ref.actor.optim.lr_warmup_steps  
- actor_rollout_ref.actor.optim.weight_decay  
- actor_rollout_ref.actor.use_kl_loss  
- actor_rollout_ref.actor.clip_ratio_low  
- actor_rollout_ref.actor.clip_ratio_high  
- actor_rollout_ref.actor.clip_ratio_c  
- actor_rollout_ref.rollout.val_kwargs.do_sample  
- actor_rollout_ref.rollout.val_kwargs.n  
- actor_rollout_ref.rollout.val_kwargs.temperature  
- actor_rollout_ref.rollout.val_kwargs.top_p  

Could you please clarify whether these configuration options were implemented as custom modifications on top of VERL 0.2.0, or whether a different VERL version was used for your experiments?

Additionally, in your replies on both GitHub and Zhihu, you mentioned that the JustRL code is largely the same as VERL and that the results should be reproducible across multiple VERL versions:
- GitHub: https://github.com/thunlp/JustRL/issues/1#issuecomment-3544791613  
- Zhihu: https://www.zhihu.com/question/1987478921730613767/answer/1988591596640416580  

Have you tested JustRL with more recent versions of VERL? If so, could you please share which VERL and vLLM versions you would recommend for reproduction? I would be happy to try them on my side. If not, could you please share additional details about the original environment, such as the Docker image, exact configurations, and code modifications required to reproduce the results?

Finally, it would be greatly appreciated if other teams who have successfully reproduced the results could share their VERL/vLLM versions and relevant training scripts.

Thank you very much for your time and help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing JustRL #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducing JustRL #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions