- Qwen 2.5-0.5B Sokoban PPO Training ✅ Completed (yuxuan)
- Qwen 2.5-7B Sokoban PPO Training ✅ Completed (mingjia)
Location: agents/*
- [✅] Handle ad‑hoc message format fixes in
get_llm_prompts() - [✅] Abstract base agent class for reusability
- Move common parts to
base_agent.pyand simplify specific agents
Location: rollout/sync_multi_turn_rollout.py
- [✅] Debug early stop logic in multi‑turn rollout
- [✅] Optimize reward computation (loss_mask, reward_mask)
- Replace
tokenizer.encode()withverl_F.tokenize_and_postprocess_data()
Location: trainer/agent_trainer.py
- [✅] Add hyperparameter for validation agent number
- [✅] Debug
_validate()vs. mingjia’s ragen implementation - [✅] Checkpoint saving frequency settings
- [✅] Fix
is_action_validmetric issue - Integrate turn‑based loss mask
- Add extra metrics and LLM generation logging to Weights & Biases
- [✅] Correct unstable validation curve
- [✅] Test general ability from simple Sokoban to large Sokoban
- [✅] Integrate more envs
- [✅] gsm8k & blocksworld
- [✅] Tetris
- [✅] Align env parameters and message printout
- [✅] Agentic WebShop and BIRD
- [✅] Test general ability across all envs
- JAX PPO trainer integration (Tunix Integration Plan)
- [✅] write tunix_sync_multi_turn_rollout.py
- [✅] finish tunix multi turn rollout part
- [✅] verify the final results ids
- [✅] integrate it with tunix_agent_trianer.py
- [✅] test the training workflow in tunix_train.py
- [✅] draft a runnable tunix multi-turn rl training
- [✅] wandb metric visualization
- [✅] validation implementation
- [✅] draft validation rollout
- [✅] understand tunix trianing and validtion logic for better integration
- [✅] solve metric logging problem
- [✅] align with hyperparameters
- [✅] research ppo update
- verify verl wandb logging implementation
- try critic model automated surgery again
- wrap up tunix training code and write instruction
- [✅] critic model building + critic tpu allocation
- [✅] reward score allocation
- [✅] prompt ids and completions ids from input ids (pattern analysis)
- [✅] fsdp + tp to reduce memory
- try cpu_offload;
- calculate memory consumption
- abstract a uniform yaml config file
- [✅] write tunix_sync_multi_turn_rollout.py
- Abstract the framework to integrate different trainers
- Implement
uvinstallation for faster package management - [✅] Package as
grlviapyproject.toml - Convert
env_setup.shinto an open and useful script - Remove submodule and wrap VERL as a monkey patch
- Vision modality support for multi‑turn RL PPO training
- SFT (Supervised Fine‑Tuning) trainer
- Asynchronous multi‑turn rollout system
When working on any of these items:
- Create a feature branch from main
- Follow the existing code style and patterns
- Add appropriate tests and documentation
- Submit a pull request with clear description of changes
- Priority should be given to completing the 7B model performance reproduction
- Codebase improvements should focus on maintainability and performance
- New features should be developed incrementally with proper testing