Authors: Shrey Pandit*, Austin Xu*, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty
This repository contains code for running evaluation for Hard2Verify
- Paper: https://arxiv.org/abs/2510.13744
- Dataset: https://huggingface.co/datasets/Salesforce/Hard2Verify
This dataset was generated using GPT, Gemini, and Claude and should not be used to develop competing products.
Run the following to setup the inference environment.
conda create -n h2v python=3.12
pip install uv
uv pip install -r requirements.txt
This also installs vLLM for locally hosted models. Please install flash-attn separately, if desired.
run_eval.py supports the following three modes of inference: OpenAI API, Together API, and a locally hosted model server (e.g., via vLLM or SGLang). Based on the name of the model passed into the script, the proper endpoint will be selected. See ModelFactory class in model.py, which is controlled by OPENAI_MODELS and TOGETHER_MODELS variables in utils.py
Sampling parameters: As part of our evaluation, we use recommended sampling parameters exist, if they exist. Else, we use greedy decoding. The script defaults to recommended sampling parameters, as implemented in get_sampling_params function in utils.py. Users can override this if desired with --override_default_sampling_params
OPENAI_API_KEY and/or TOGETHER_API_KEY if running inference with API models!
python run_eval.py \
--task_type step_level \ <-- This is the eval task (step_level or error_id)
--model_name name_of_hosted_model \ <-- Based on model name, will route to one of the
--output_directory path/to/output/ \ <-- Parent output directory. Results will save as /path/to/output/model_name/
--force_rerun \ <-- Script checks if results exist. Overwrite existing w/ --force_rerun
--debug \ <-- Only runs the first 10 samples
--port 8000 \ <-- Port of locally hosted model, unused for non-local inference
--override_default_sampling_params \ <-- Override per-model default sampling parameters with sampling parameters below
--temperature 0.0 \
--top_p 1.0 \
--max_tokens 32768 \
--top_k -1 \
--reasoning_effort {low, medium, high} \ <-- gpt-oss/GPT-5 models
--enable_thinking \ <-- For Qwen3 models
@misc{pandit2025hard,
title={Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math},
author={Pandit, Shrey and Xu, Austin and Nguyen, Xuan-Phi and Ming, Yifei and Xiong, Caiming and Joty, Shafiq},
year={2025},
journal={arXiv preprint arXiv:2510.13744},
}