Skip to content

Latest commit

 

History

History
76 lines (47 loc) · 2.22 KB

File metadata and controls

76 lines (47 loc) · 2.22 KB

MATH Training and Evaluation

This repository contains scripts for training and evaluating models on the MATH dataset.


Environment Setup

Our experiments were successfully conducted on 8×80GB A100 GPUs with CUDA 12.4.
To reproduce the environment, run the following commands:

conda env create -f environment.yml
conda activate MATH
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.2/flashinfer_python-0.2.2+cu124torch2.6-cp38-abi3-linux_x86_64.whl#sha256=5e1cdb2fb7c0e9e9a2a2241becc52b771dc0093dd5f54e10f8bf612e46ef93a9

Running the Code

After setting up the environment, you can start training with:

bash examples/Qwen2_5_MATH_1_5_b_CCGSPG.sh

During training, models are continuously evaluated, and all experiment logs are automatically tracked via Weights & Biases (wandb).


Using Pretrained Models

We provide pretrained checkpoints for reproducibility:

  1. Download the model from this link.

  2. Place the checkpoints folder in the following path:

    MATH_Code/checkpoints/MATH/NEW_qwen2_5_MATH_1_5b_ccpo_bce_beta_0.5
    
  3. Run the script:

    bash examples/Qwen2_5_MATH_1_5_b_CCGSPG.sh

This will allow you to see results in wandb or check the log file at:

MATH_Code/checkpoints/MATH/NEW_qwen2_5_MATH_1_5b_ccpo_bce_beta_0.5/training_process.log

Evaluation Metrics

To compute evaluation metrics such as Accuracy, Expected Calibration Error (ECE), and Brier Score (BS), simply specify the path to your generated outputs and run:

python cal_metric.py

Acknowledgements

  • This repository is built upon verl. We thank the verl team for open-sourcing such a powerful RL4LLMs framework.
  • We sincerely acknowledge the datasets and reward functions provided by DeepScaleR and AdaRFT