Reproduce Jan models with Verl codebase

We need to learn Verl codebase and try to reproduce the result of Jan-v1, Jan-nano to ensure consistency.

- Tools
- multi turn Agent loop
- roll out 
- custom reward with chat history
- wandb logger for custom reward
Verl Documentation: https://verl.readthedocs.io/en/latest/sglang_multiturn/multiturn.html#multi-turn-rollout-support