Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Authors: Shrey Pandit*, Austin Xu*, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty

This repository contains code for running evaluation for Hard2Verify

Paper: https://arxiv.org/abs/2510.13744
Dataset: https://huggingface.co/datasets/Salesforce/Hard2Verify

⚠️ To prevent leakage and/or contamination, the data we uploaded to Huggingface is encrypted. The evaluation script automatically decrypts to a human readable format. Please do not upload unencrypted versions of the dataset to Huggingface!

This dataset was generated using GPT, Gemini, and Claude and should not be used to develop competing products.

Setup

Run the following to setup the inference environment.

conda create -n h2v python=3.12
pip install uv
uv pip install -r requirements.txt

This also installs vLLM for locally hosted models. Please install flash-attn separately, if desired.

Run evaluation

run_eval.py supports the following three modes of inference: OpenAI API, Together API, and a locally hosted model server (e.g., via vLLM or SGLang). Based on the name of the model passed into the script, the proper endpoint will be selected. See ModelFactory class in model.py, which is controlled by OPENAI_MODELS and TOGETHER_MODELS variables in utils.py

Sampling parameters: As part of our evaluation, we use recommended sampling parameters exist, if they exist. Else, we use greedy decoding. The script defaults to recommended sampling parameters, as implemented in get_sampling_params function in utils.py. Users can override this if desired with --override_default_sampling_params

⚠️ Make sure you set environment variables OPENAI_API_KEY and/or TOGETHER_API_KEY if running inference with API models!

python run_eval.py \
    --task_type step_level \                        <-- This is the eval task (step_level or error_id)
    --model_name name_of_hosted_model \             <-- Based on model name, will route to one of the 
    --output_directory path/to/output/ \            <-- Parent output directory. Results will save as /path/to/output/model_name/
    --force_rerun \                                 <-- Script checks if results exist. Overwrite existing w/ --force_rerun
    --debug \                                       <-- Only runs the first 10 samples
    --port 8000 \                                   <-- Port of locally hosted model, unused for non-local inference
    --override_default_sampling_params \            <-- Override per-model default sampling parameters with sampling parameters below
    --temperature 0.0 \
    --top_p 1.0 \
    --max_tokens 32768 \
    --top_k -1 \
    --reasoning_effort {low, medium, high} \        <-- gpt-oss/GPT-5 models
    --enable_thinking \                             <-- For Qwen3 models

Citation

@misc{pandit2025hard,
    title={Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math}, 
    author={Pandit, Shrey and Xu, Austin and Nguyen, Xuan-Phi and Ming, Yifei and Xiong, Caiming and Joty, Shafiq},
    year={2025},
    journal={arXiv preprint arXiv:2510.13744},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
model.py		model.py
requirements.txt		requirements.txt
run_eval.py		run_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Setup

Run evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Setup

Run evaluation

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages