This repository contains scripts for training and evaluating models on the MATH dataset.
Our experiments were successfully conducted on 8×80GB A100 GPUs with CUDA 12.4.
To reproduce the environment, run the following commands:
conda env create -f environment.yml
conda activate MATH
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2/flashinfer_python-0.2.2+cu124torch2.6-cp38-abi3-linux_x86_64.whl#sha256=5e1cdb2fb7c0e9e9a2a2241becc52b771dc0093dd5f54e10f8bf612e46ef93a9After setting up the environment, you can start training with:
bash examples/Qwen2_5_MATH_1_5_b_CCGSPG.shDuring training, models are continuously evaluated, and all experiment logs are automatically tracked via Weights & Biases (wandb).
We provide pretrained checkpoints for reproducibility:
-
Download the model from this link.
-
Place the
checkpointsfolder in the following path:MATH_Code/checkpoints/MATH/NEW_qwen2_5_MATH_1_5b_ccpo_bce_beta_0.5 -
Run the script:
bash examples/Qwen2_5_MATH_1_5_b_CCGSPG.sh
This will allow you to see results in wandb or check the log file at:
MATH_Code/checkpoints/MATH/NEW_qwen2_5_MATH_1_5b_ccpo_bce_beta_0.5/training_process.log
To compute evaluation metrics such as Accuracy, Expected Calibration Error (ECE), and Brier Score (BS), simply specify the path to your generated outputs and run:
python cal_metric.py- This repository is built upon verl. We thank the verl team for open-sourcing such a powerful RL4LLMs framework.
- We sincerely acknowledge the datasets and reward functions provided by DeepScaleR and AdaRFT