EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Overview

EquiBench is a comprehensive benchmark designed to evaluate the code reasoning capabilities of Large Language Models (LLMs) through equivalence checking tasks. This framework helps researchers and developers assess how well different LLMs understand code semantics, reason about program functionality, and determine when two code snippets are functionally equivalent despite syntactic differences.

Key Features

Diverse Test Cases: Includes 2400 pairs of equivalent programs across six distinct categories (DCE for C programs, STOKE for x86-64 programs, TVM for CUDA programs, and OJ_A, OJ_V, OJ_VA for Python competitive programming problems)
Multiple Prompting Strategies: Support for zero-shot, few-shot, and chain-of-thought variations to evaluate different reasoning approaches
Wide Model Support: Compatible with leading LLMs from OpenAI, Anthropic, Meta, Mistral AI, Qwen, and DeepSeek
Standardized Methodology: Consistent evaluation framework enabling fair comparison across different model architectures

Initial Setup

Clone the repository and navigate to the directory

git clone https://github.com/Anjiang-Wei/EquiBench.git
cd EquiBench

Use Python version 3.12 or higher

You can use pyenv to manage Python versions:

Prerequisite: Install pyenv

Install Python 3.12 using pyenv:
```
pyenv install 3.12
pyenv local 3.12
```

Create a virtual environment and activate it

python -m venv .venv
source .venv/bin/activate

Update pip and install required packages
```
pip install --upgrade pip
pip install .
```

Set up API keys in a .env file:

Create an empty .env file

touch .env

Add the following API keys to your .env file:

OPENAI_API_KEY=<your OpenAI key here>
ANTHROPIC_API_KEY=<your Anthropic key here>
TOGETHER_API_KEY=<your Together key here>
HF_TOKEN=<your HuggingFace access token here>

Daily Setup

When returning to work on EquiBench:

Navigate to the repository directory
```
cd EquiBench
```
Activate the virtual environment
```
source .venv/bin/activate
```

Step 1: Downloading Datasets

Obtain a read or write type access token from HuggingFace
Login using the access token:

Option A: Log in via command line and verify access:
```
huggingface-cli login
huggingface-cli whoami
```
Option B: Add your token directly to the .env file as the HF_TOKEN environment variable.
Download the datasets:
```
python step1_data.py data
```
This will download all 2400 program pairs from the EquiBench-Datasets repository on HuggingFace.

Step 2: Running Evaluations

Execute the evaluation script with your desired configuration. The example below runs a zero-shot evaluation on three different models with a sample limit of 1 for each category:

python step2_eval.py data result/eval \
    --prompt_types ZERO \
    --models \
    openai/gpt-4o-mini-2024-07-18 \
    anthropic/claude-3-5-sonnet-20241022 \
    Qwen/Qwen2.5-7B-Instruct-Turbo \
    --limit 1

Additional Options

The evaluation script supports several command-line options:

--models: List of models to evaluate (see Supported Models)
--limit: Number of test pairs to evaluate per category (omit to evaluate all 400 pairs per category)
--prompt_types: Types of prompting strategies to use (see Supported Prompt Types)
--categories: Select specific categories for evaluation. Choices: DCE, STOKE, TVM, OJ_A, OJ_V, OJ_VA
--prompt_path: Path to custom prompt templates, default as prompts.toml
--log_level: Set logging verbosity, default as INFO. Choices: DEBUG, INFO, WARNING, ERROR

Example Commands

# Evaluate all models on a single category with few-shot prompting
python step2_eval.py data result/eval --prompt_types FEW --categories OJ_A --limit 10

# Evaluate one model on all categories with chain-of-thought reasoning
python step2_eval.py data result/eval --prompt_types ZERO_COT --models openai/gpt-4o-2024-11-20

# Custom output directory
python step2_eval.py data custom_results --prompt_types ZERO FEW

Finally, run the result summary

python step3_stat.py

Supported Categories

EquiBench contains 2400 pairs of programs across six distinct categories of code equivalence tasks:

DCE (Dead Code Elimination for C programs): Code pairs that differ by removal of dead / live code
STOKE (Superoptimizer for x86-64 program): Assembly code pairs optimized using the STOKE framework
TVM (Compiler Scheduling for CUDA programs): Code pairs optimized for tensor operations
OJ_A (Python Competitive Programming - Algorithm): Different algorithmic solutions to the same programming problem
OJ_V (Python Competitive Programming - Variable Renaming): Code pairs with variable renaming transformations
OJ_VA (Python Competitive Programming - Variables + Algorithms): Code pairs with both variable renaming and algorithmic differences

Each category contains 400 pairs of programs (200 equivalent and 200 inequivalent), providing a diverse range of challenges for LLMs to reason about code semantics.

Supported Prompt Types

EquiBench evaluates models using four different prompting strategies:

ZERO: Zero-shot prompting (directly asking the model without examples)
FEW: Few-shot prompting (providing example problems and solutions)
ZERO_COT: Zero-shot chain of thought (encouraging step-by-step reasoning)
FEW_COT: Few-shot chain of thought (examples with step-by-step reasoning)

Each strategy tests different aspects of a model's reasoning capabilities, from basic understanding to advanced reasoning chains.

Supported Models

EquiBench supports evaluation across a diverse range of LLMs:

OpenAI Models

openai/o1-mini-2024-09-12
openai/gpt-4o-2024-11-20
openai/gpt-4o-mini-2024-07-18
openai/o3-mini-2025-01-31

Anthropic Models

anthropic/claude-3-5-sonnet-20241022

Meta (Llama) Models

meta-llama/Llama-3.2-3B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo

Mistral AI Models

mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mixtral-8x22B-Instruct-v0.1

Qwen Models

Qwen/Qwen2.5-7B-Instruct-Turbo
Qwen/Qwen2.5-72B-Instruct-Turbo
Qwen/QwQ-32B-Preview

DeepSeek Models

deepseek-ai/DeepSeek-R1
deepseek-ai/DeepSeek-V3

Additional models from OpenAI, Anthropic, and together.ai platforms are also supported.

HuggingFace Dataset

The EquiBench dataset is hosted on HuggingFace as anjiangwei/EquiBench-Datasets.

Dataset Statistics

Category	Language	Equivalent Pairs	Inequivalent Pairs	Total
DCE	C	200	200	400
STOKE	x86-64	200	200	400
TVM	CUDA	200	200	400
OJ_A	Python	200	200	400
OJ_V	Python	200	200	400
OJ_VA	Python	200	200	400
Total		1200	1200	2400

Direct Access via HuggingFace Datasets Library

You can directly access the dataset using the HuggingFace datasets library:

from datasets import load_dataset

# Define dataset path
hf_path = "anjiangwei/EquiBench-Datasets"

# Load specific categories
dce_dataset = load_dataset(path=hf_path, name="DCE")
stoke_dataset = load_dataset(path=hf_path, name="STOKE")
tvm_dataset = load_dataset(path=hf_path, name="TVM")
oj_a_dataset = load_dataset(path=hf_path, name="OJ_A")
oj_v_dataset = load_dataset(path=hf_path, name="OJ_V")
oj_va_dataset = load_dataset(path=hf_path, name="OJ_VA")

# Example: Access the first pair in OJ_A category
print(oj_a_dataset["train"][0])

Evaluation Results

Below is a summary of performance across different models and prompting strategies based on our paper experiments:

Model	DCE	CUDA	x86-64	OJ_A	OJ_V	OJ_VA	Overall Accuracy
Random Baseline	50.0	50.0	50.0	50.0	50.0	50.0	50.0
Llama-3.2-3B-Instruct-Turbo	50.0	49.8	50.0	51.5	51.5	51.5	50.7
Llama-3.1-8B-Instruct-Turbo	41.8	49.8	50.5	57.5	75.5	56.8	55.3
Mistral-7B-Instruct-v0.3	51.0	57.2	73.8	50.7	50.5	50.2	55.6
Mixtral-8x7B-Instruct-v0.1	50.2	47.0	64.2	59.0	61.5	55.0	56.1
Mixtral-8x22B-Instruct-v0.1	46.8	49.0	62.7	63.5	76.0	62.7	60.1
Llama-3.1-70B-Instruct-Turbo	47.5	50.0	58.5	66.2	72.0	67.5	60.3
QwQ-32B-Preview	48.2	50.5	62.7	65.2	71.2	64.2	60.3
Qwen2.5-7B-Instruct-Turbo	50.5	49.2	58.0	62.0	80.8	63.0	60.6
gpt-4o-mini-2024-07-18	46.8	50.2	56.8	64.5	91.2	64.0	62.2
Qwen2.5-72B-Instruct-Turbo	42.8	56.0	64.8	72.0	76.5	70.8	63.8
Llama-3.1-405B-Instruct-Turbo	40.0	49.0	75.0	72.2	74.5	72.8	63.9
DeepSeek-V3	41.0	50.7	69.2	73.0	83.5	72.5	65.0
gpt-4o-2024-11-20	43.2	49.5	65.2	71.0	87.0	73.8	65.0
claude3.5-sonnet-2024-10-22	38.5	62.3	70.0	71.2	78.0	73.5	65.6
o1-mini-2024-09-12	55.8	50.7	74.2	80.0	89.8	78.8	71.5
DeepSeek-R1	52.2	61.0	78.2	79.8	91.5	78.0	73.5
o3-mini-2025-01-31	68.8	59.0	84.5	84.2	88.2	83.2	78.0
Mean	47.9	52.4	65.8	67.3	76.4	67.0	62.8

Table: Accuracy of 17 models on EquiBench under 0-shot prompting.
We report accuracy for each of the six equivalence categories along with the overall accuracy.

Note: These results represent average accuracy across all categories. For detailed results, please refer to our paper.

Citation

If you use EquiBench in your research, please cite our paper:

Paper Link

@article{wei2025equibench,
  title={EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking},
  author={Wei, Anjiang and Cao, Jiannan and Li, Ran and Chen, Hongyu and Zhang, Yuhui and Wang, Ziheng and Sun, Yaofeng and Liu, Yuan and Teixeira, Thiago S. F. X. and Yang, Diyi and Wang, Ke and Aiken, Alex},
  journal={arXiv preprint arXiv:2502.12466},
  year={2025}
}

License

Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.vscode		.vscode
llm		llm
steps		steps
type		type
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prompts.yaml		prompts.yaml
pyproject.toml		pyproject.toml
step1_data.py		step1_data.py
step2_eval.py		step2_eval.py
step3_stat.py		step3_stat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Overview

Key Features

Table of Contents

Initial Setup

Daily Setup

Step 1: Downloading Datasets

Step 2: Running Evaluations

Additional Options

Example Commands

Supported Categories

Supported Prompt Types

Supported Models

OpenAI Models

Anthropic Models

Meta (Llama) Models

Mistral AI Models

Qwen Models

DeepSeek Models

HuggingFace Dataset

Dataset Statistics

Direct Access via HuggingFace Datasets Library

Evaluation Results

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Anjiang-Wei/EquiBench

Folders and files

Latest commit

History

Repository files navigation

EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking

Overview

Key Features

Table of Contents

Initial Setup

Daily Setup

Step 1: Downloading Datasets

Step 2: Running Evaluations

Additional Options

Example Commands

Supported Categories

Supported Prompt Types

Supported Models

OpenAI Models

Anthropic Models

Meta (Llama) Models

Mistral AI Models

Qwen Models

DeepSeek Models

HuggingFace Dataset

Dataset Statistics

Direct Access via HuggingFace Datasets Library

Evaluation Results

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages