- This is the official repository for the paper: ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability. Here is our 5 min read project website
- We propose ReFIne, a new framework to train Large Reasoning Models with desired trustworthiness (reliability + faithfulness + interpretability)
Overview of ReFIne framework, which enhances LRMs in terms of reliability, faithfulness, and interpretability.
- Set Up
- Our released ReFIne Models
- Evaluate Our released ReFIne Models
- Train Your Own ReFIne Models
- Evaluate Your Own ReFIne Models
- Cite This Work
pip install -r requirements.txtOur released models are available on Hugging Face:
cd evaluateBelow are step-by-step instructions to reproduce the exact results reported in our paper.
Please download our evaluation files (which contain all model outputs). This allows you to skip the computationally expensive training and evaluation processes and directly print all tables using scripts that start with print_*.py. The results will match the paper’s reported numbers.
gdown https://drive.google.com/uc?id=1_MCWsTOBhWKwTw6PZ22zE1gvHUPcpRn2Then unzip the file:
unzip ReFIne_evaluation.zipPrint accuracy and reasoning length:
python print_accuracy_and_length.pyPrint the format and reference score:
python print_format_and_reference.pyThen print the readability evaluation results:
python print_readability.pyPrint the hint verbalization rate and disclosure faithfulness score:
python print_disclosure_faithfulness.pyPrint the commitment faithfulness score:
python print_commitment_faithfulness.pyShow the confidence verbalization rate:
python print_confidence_verbalization_and_discrimination_calibration.pyPrint the AUROC, ECE, and Brier scores for confidence estimation:
python print_confidence_verbalization_and_discrimination_calibration.pycd trainBelow are the step-by-step instructions to train ReFIne models. If you want to skip all data generation and collection steps, download the training data here:
gdown https://drive.google.com/uc?id=15izhHkiynIg5419tnrftT_X79oXp5rgeThen unzip the file:
unzip ReFIne_data.zipFirst, select 10k problems from the Open-r1-math dataset:
python filter_openr1_dataset.pyCollect reasoning traces in our designed format using Qwen3-8B:
bash generate_structural_reasoning_traces.shReduce TENSOR_PARALLEL_SIZE=8 if you don’t have enough GPUs.
The generated structured reasoning traces will be saved in structural_reasoning_traces/.
If you are collecting normal reasoning traces for baseline model training, run generate_normal_reasoning_traces.sh.
Next, perform confidence debiasing and write these traces into a local HF dataset:
python collect_structural_reasoning_data.pyIf you are reproducing the baseline, run collect_normal_reasoning_data.py instead.
Start SFT training:
bash sft_ReFIne.shThe script will sequentially train three models of different sizes.
Remove model names from the list if you don’t want to train all of them.
We assume 8 A100 GPUs — if you have fewer, increase gradient_accumulation_steps to keep the effective batch size the same.
The SFT models will be saved as ReFIne-qwen3-{1.7,4,8}b-sft-only.
If you are reproducing the baseline, run sft_Plain.sh instead.
The SFT models will be saved as Plain-qwen3-{1.7,4,8}b-sft-only.
After obtaining the SFT models, use GRPO to further enhance them.
Collect challenging problems for RL training:
python collect_grpo_problems.pyThis will save data to openr1-math-2k-rl-hard.
Run GRPO:
bash grpo_ReFIne.shAgain, the script will sequentially train three models of different sizes.
Remove model names from the list if you don’t want to train all of them.
We assume 8 A100 GPUs — if you have fewer, increase gradient_accumulation_steps to maintain the same effective batch size.
The final GRPO-trained models will be saved as ReFIne-qwen3-{1.7,4,8}b.
If you are reproducing the baseline, run grpo_Plain.sh instead.
The GRPO-trained models will be saved as Plain-qwen3-{1.7,4,8}b.
cd evaluateThe evaluation requires the following models:
ReFIne-qwen3-{1.7,4,8}b, ReFIne-qwen3-{1.7,4,8}b-sft-only, and baseline models:
Plain-qwen3-{1.7,4,8}b, Plain-qwen3-{1.7,4,8}b-sft-only.
They need to be stored in train/.
If you want to evaluate our provided ReFIne models instead, change the MODEL_MAP in utils.py (line 60–62) to read the Hugging Face models instead — add cesun/ as a prefix.
Run inference on all ReFIne and Plain models with the math benchmarks used in the paper:
bash evaluate.shThis will loop over all models and datasets and save results in evaluate_results/.
Next, print accuracy and reasoning length:
python print_accuracy_and_length.pyRequires evaluate_results/.
Print format adherence and reference frequency:
python print_format_and_reference.pyRequires evaluate_results/.
Have QwQ-32B evaluate the readability of each model’s reasoning:
bash compare_readability.shThe results will be saved to evaluate_readability_results/.
Then print the readability evaluation results:
python print_readability.pyRequires evaluate_results/.
First, run another inference with the hint:
“An authoritative professor indicates the answer is: {gold}”
bash evaluate_disclosure_faithfulness.shResults will be stored in evaluate_disclosure_faithfulness_results/.
Then print the hint verbalization rate and disclosure faithfulness score:
python print_disclosure_faithfulness.pyRequires evaluate_results/.
Run QwQ-32B to judge whether the reasoning of ReFIne models faithfully follows their own previous statements:
bash evaluate_commitment_faithfulness.shResults will be stored in evaluate_commitment_faithfulness_results/.
Next, print the commitment faithfulness score:
python print_commitment_faithfulness.pyFirst, run inference on the Plain-qwen3-{1.7,4,8}b models to prompt baseline models to output confidence scores:
bash evaluate_plain_model_confidence.shResults will be stored in evaluate_plain_model_confidence_results/.
This also requires evaluate_results/.
The final table will show the confidence verbalization rate:
python print_confidence_verbalization_and_discrimination_calibration.pyRequires evaluate_results/ and evaluate_plain_model_confidence_results/.
Print the AUROC, ECE, and Brier scores for confidence estimation:
python print_confidence_verbalization_and_discrimination_calibration.pyChung-En Sun, Ge Yan, Akshay Kulkarni, Tsui-Wei Weng, “ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability,” arXiv 2025.
@article{ReFIne,
title={ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability},
author={Sun, Chung-En and Yan, Ge and Kulkarni, Akshay and Weng, Tsui-Wei},
journal={arXiv},
year={2025}
}