Skip to content

SalesforceAIResearch/Hard2Verify

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Authors: Shrey Pandit*, Austin Xu*, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty

This repository contains code for running evaluation for Hard2Verify

⚠️ To prevent leakage and/or contamination, the data we uploaded to Huggingface is encrypted. The evaluation script automatically decrypts to a human readable format. Please do not upload unencrypted versions of the dataset to Huggingface!

This dataset was generated using GPT, Gemini, and Claude and should not be used to develop competing products.

Setup

Run the following to setup the inference environment.

conda create -n h2v python=3.12
pip install uv
uv pip install -r requirements.txt

This also installs vLLM for locally hosted models. Please install flash-attn separately, if desired.

Run evaluation

run_eval.py supports the following three modes of inference: OpenAI API, Together API, and a locally hosted model server (e.g., via vLLM or SGLang). Based on the name of the model passed into the script, the proper endpoint will be selected. See ModelFactory class in model.py, which is controlled by OPENAI_MODELS and TOGETHER_MODELS variables in utils.py

Sampling parameters: As part of our evaluation, we use recommended sampling parameters exist, if they exist. Else, we use greedy decoding. The script defaults to recommended sampling parameters, as implemented in get_sampling_params function in utils.py. Users can override this if desired with --override_default_sampling_params

⚠️ Make sure you set environment variables OPENAI_API_KEY and/or TOGETHER_API_KEY if running inference with API models!

python run_eval.py \
    --task_type step_level \                        <-- This is the eval task (step_level or error_id)
    --model_name name_of_hosted_model \             <-- Based on model name, will route to one of the 
    --output_directory path/to/output/ \            <-- Parent output directory. Results will save as /path/to/output/model_name/
    --force_rerun \                                 <-- Script checks if results exist. Overwrite existing w/ --force_rerun
    --debug \                                       <-- Only runs the first 10 samples
    --port 8000 \                                   <-- Port of locally hosted model, unused for non-local inference
    --override_default_sampling_params \            <-- Override per-model default sampling parameters with sampling parameters below
    --temperature 0.0 \
    --top_p 1.0 \
    --max_tokens 32768 \
    --top_k -1 \
    --reasoning_effort {low, medium, high} \        <-- gpt-oss/GPT-5 models
    --enable_thinking \                             <-- For Qwen3 models

Citation

@misc{pandit2025hard,
    title={Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math}, 
    author={Pandit, Shrey and Xu, Austin and Nguyen, Xuan-Phi and Ming, Yifei and Xiong, Caiming and Joty, Shafiq},
    year={2025},
    journal={arXiv preprint arXiv:2510.13744},
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages