[NeurIPS 2025 Spotlight] ReasonFlux (long-CoT), ReasonFlux-PRM (process reward model) and ReasonFlux-Coder (code generation)
reinforcement-learning chain-of-thought llm-rlhf sft-data o1-mini o1-preview deepseek-v3 deepseek-r1
-
Updated
Sep 27, 2025 - Python