EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking
EquiBench is a comprehensive benchmark designed to evaluate the code reasoning capabilities of Large Language Models (LLMs) through equivalence checking tasks. This framework helps researchers and developers assess how well different LLMs understand code semantics, reason about program functionality, and determine when two code snippets are functionally equivalent despite syntactic differences.
- Diverse Test Cases: Includes 2400 pairs of equivalent programs across six distinct categories (
DCE
for C programs,STOKE
for x86-64 programs,TVM
for CUDA programs, andOJ_A
,OJ_V
,OJ_VA
for Python competitive programming problems) - Multiple Prompting Strategies: Support for zero-shot, few-shot, and chain-of-thought variations to evaluate different reasoning approaches
- Wide Model Support: Compatible with leading LLMs from OpenAI, Anthropic, Meta, Mistral AI, Qwen, and DeepSeek
- Standardized Methodology: Consistent evaluation framework enabling fair comparison across different model architectures
- Overview
- Steps
- Details
- HuggingFace Dataset
- Evaluation Results
- Citation
- License
-
Clone the repository and navigate to the directory
git clone https://github.com/Anjiang-Wei/EquiBench.git cd EquiBench
-
Use Python version 3.12 or higher
You can use
pyenv
to manage Python versions:Prerequisite: Install pyenv
Install Python 3.12 using pyenv:
pyenv install 3.12 pyenv local 3.12
-
Create a virtual environment and activate it
python -m venv .venv source .venv/bin/activate
-
Update pip and install required packages
pip install --upgrade pip pip install .
-
Set up API keys in a
.env
file:Create an empty
.env
filetouch .env
Add the following API keys to your
.env
file:OPENAI_API_KEY=<your OpenAI key here> ANTHROPIC_API_KEY=<your Anthropic key here> TOGETHER_API_KEY=<your Together key here> HF_TOKEN=<your HuggingFace access token here>
When returning to work on EquiBench:
-
Navigate to the repository directory
cd EquiBench
-
Activate the virtual environment
source .venv/bin/activate
-
Obtain a
read
orwrite
type access token from HuggingFace -
Login using the access token:
Option A: Log in via command line and verify access:
huggingface-cli login huggingface-cli whoami
Option B: Add your token directly to the
.env
file as theHF_TOKEN
environment variable. -
Download the datasets:
python step1_data.py data
This will download all 2400 program pairs from the EquiBench-Datasets repository on HuggingFace.
Execute the evaluation script with your desired configuration. The example below runs a zero-shot evaluation on three different models with a sample limit of 1 for each category:
python step2_eval.py data result/eval \
--prompt_types ZERO \
--models \
openai/gpt-4o-mini-2024-07-18 \
anthropic/claude-3-5-sonnet-20241022 \
Qwen/Qwen2.5-7B-Instruct-Turbo \
--limit 1
The evaluation script supports several command-line options:
--models
: List of models to evaluate (see Supported Models)--limit
: Number of test pairs to evaluate per category (omit to evaluate all 400 pairs per category)--prompt_types
: Types of prompting strategies to use (see Supported Prompt Types)--categories
: Select specific categories for evaluation. Choices:DCE
,STOKE
,TVM
,OJ_A
,OJ_V
,OJ_VA
--prompt_path
: Path to custom prompt templates, default asprompts.toml
--log_level
: Set logging verbosity, default asINFO
. Choices:DEBUG
,INFO
,WARNING
,ERROR
# Evaluate all models on a single category with few-shot prompting
python step2_eval.py data result/eval --prompt_types FEW --categories OJ_A --limit 10
# Evaluate one model on all categories with chain-of-thought reasoning
python step2_eval.py data result/eval --prompt_types ZERO_COT --models openai/gpt-4o-2024-11-20
# Custom output directory
python step2_eval.py data custom_results --prompt_types ZERO FEW
EquiBench contains 2400 pairs of programs across six distinct categories of code equivalence tasks:
- DCE (Dead Code Elimination for C programs): Code pairs that differ by removal of dead / live code
- STOKE (Superoptimizer for x86-64 program): Assembly code pairs optimized using the STOKE framework
- TVM (Compiler Scheduling for CUDA programs): Code pairs optimized for tensor operations
- OJ_A (Python Competitive Programming - Algorithm): Different algorithmic solutions to the same programming problem
- OJ_V (Python Competitive Programming - Variable Renaming): Code pairs with variable renaming transformations
- OJ_VA (Python Competitive Programming - Variables + Algorithms): Code pairs with both variable renaming and algorithmic differences
Each category contains 400 pairs of programs (200 equivalent and 200 inequivalent), providing a diverse range of challenges for LLMs to reason about code semantics.
EquiBench evaluates models using four different prompting strategies:
ZERO
: Zero-shot prompting (directly asking the model without examples)FEW
: Few-shot prompting (providing example problems and solutions)ZERO_COT
: Zero-shot chain of thought (encouraging step-by-step reasoning)FEW_COT
: Few-shot chain of thought (examples with step-by-step reasoning)
Each strategy tests different aspects of a model's reasoning capabilities, from basic understanding to advanced reasoning chains.
EquiBench supports evaluation across a diverse range of LLMs:
openai/o1-mini-2024-09-12
openai/gpt-4o-2024-11-20
openai/gpt-4o-mini-2024-07-18
openai/o3-mini-2025-01-31
anthropic/claude-3-5-sonnet-20241022
meta-llama/Llama-3.2-3B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mixtral-8x22B-Instruct-v0.1
Qwen/Qwen2.5-7B-Instruct-Turbo
Qwen/Qwen2.5-72B-Instruct-Turbo
Qwen/QwQ-32B-Preview
deepseek-ai/DeepSeek-R1
deepseek-ai/DeepSeek-V3
Additional models from OpenAI, Anthropic, and together.ai platforms are also supported.
The EquiBench dataset is hosted on HuggingFace as anjiangwei/EquiBench-Datasets.
Category | Language | Equivalent Pairs | Inequivalent Pairs | Total |
---|---|---|---|---|
DCE | C | 200 | 200 | 400 |
STOKE | x86-64 | 200 | 200 | 400 |
TVM | CUDA | 200 | 200 | 400 |
OJ_A | Python | 200 | 200 | 400 |
OJ_V | Python | 200 | 200 | 400 |
OJ_VA | Python | 200 | 200 | 400 |
Total | 1200 | 1200 | 2400 |
You can directly access the dataset using the HuggingFace datasets
library:
from datasets import load_dataset
# Define dataset path
hf_path = "anjiangwei/EquiBench-Datasets"
# Load specific categories
dce_dataset = load_dataset(path=hf_path, name="DCE")
stoke_dataset = load_dataset(path=hf_path, name="STOKE")
tvm_dataset = load_dataset(path=hf_path, name="TVM")
oj_a_dataset = load_dataset(path=hf_path, name="OJ_A")
oj_v_dataset = load_dataset(path=hf_path, name="OJ_V")
oj_va_dataset = load_dataset(path=hf_path, name="OJ_VA")
# Example: Access the first pair in OJ_A category
print(oj_a_dataset["train"][0])
Below is a summary of performance across different models and prompting strategies based on our paper experiments:
Model | DCE | CUDA | x86-64 | OJ_A | OJ_V | OJ_VA | Overall Accuracy |
---|---|---|---|---|---|---|---|
Random Baseline | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
Llama-3.2-3B-Instruct-Turbo | 50.0 | 49.8 | 50.0 | 51.5 | 51.5 | 51.5 | 50.7 |
Llama-3.1-8B-Instruct-Turbo | 41.8 | 49.8 | 50.5 | 57.5 | 75.5 | 56.8 | 55.3 |
Mistral-7B-Instruct-v0.3 | 51.0 | 57.2 | 73.8 | 50.7 | 50.5 | 50.2 | 55.6 |
Mixtral-8x7B-Instruct-v0.1 | 50.2 | 47.0 | 64.2 | 59.0 | 61.5 | 55.0 | 56.1 |
Mixtral-8x22B-Instruct-v0.1 | 46.8 | 49.0 | 62.7 | 63.5 | 76.0 | 62.7 | 60.1 |
Llama-3.1-70B-Instruct-Turbo | 47.5 | 50.0 | 58.5 | 66.2 | 72.0 | 67.5 | 60.3 |
QwQ-32B-Preview | 48.2 | 50.5 | 62.7 | 65.2 | 71.2 | 64.2 | 60.3 |
Qwen2.5-7B-Instruct-Turbo | 50.5 | 49.2 | 58.0 | 62.0 | 80.8 | 63.0 | 60.6 |
gpt-4o-mini-2024-07-18 | 46.8 | 50.2 | 56.8 | 64.5 | 91.2 | 64.0 | 62.2 |
Qwen2.5-72B-Instruct-Turbo | 42.8 | 56.0 | 64.8 | 72.0 | 76.5 | 70.8 | 63.8 |
Llama-3.1-405B-Instruct-Turbo | 40.0 | 49.0 | 75.0 | 72.2 | 74.5 | 72.8 | 63.9 |
DeepSeek-V3 | 41.0 | 50.7 | 69.2 | 73.0 | 83.5 | 72.5 | 65.0 |
gpt-4o-2024-11-20 | 43.2 | 49.5 | 65.2 | 71.0 | 87.0 | 73.8 | 65.0 |
claude3.5-sonnet-2024-10-22 | 38.5 | 62.3 | 70.0 | 71.2 | 78.0 | 73.5 | 65.6 |
o1-mini-2024-09-12 | 55.8 | 50.7 | 74.2 | 80.0 | 89.8 | 78.8 | 71.5 |
DeepSeek-R1 | 52.2 | 61.0 | 78.2 | 79.8 | 91.5 | 78.0 | 73.5 |
o3-mini-2025-01-31 | 68.8 | 59.0 | 84.5 | 84.2 | 88.2 | 83.2 | 78.0 |
Mean | 47.9 | 52.4 | 65.8 | 67.3 | 76.4 | 67.0 | 62.8 |
Table: Accuracy of 17 models on EquiBench under 0-shot prompting.
We report accuracy for each of the six equivalence categories along with the overall accuracy.
Note: These results represent average accuracy across all categories. For detailed results, please refer to our paper.
If you use EquiBench in your research, please cite our paper:
@article{wei2025equibench,
title={EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking},
author={Wei, Anjiang and Cao, Jiannan and Li, Ran and Chen, Hongyu and Zhang, Yuhui and Wang, Ziheng and Sun, Yaofeng and Liu, Yuan and Teixeira, Thiago S. F. X. and Yang, Diyi and Wang, Ke and Aiken, Alex},
journal={arXiv preprint arXiv:2502.12466},
year={2025}
}
Apache License 2.0. See the LICENSE file for details.