Chess Eval

An Inspect evaluation that tests LLMs' chess ability against Stockfish at various skill levels. Models are accessed via OpenRouter, so any model available there can be evaluated with no code changes.

Eval variants

File	Description
`main.py`	Stateless — each turn sends a standalone prompt with the current PGN
`main_in_context.py`	In-context — maintains the full conversation history so the model can build on its prior reasoning
`main_image.py`	Image — like in-context, but each turn also includes a rendered board image (requires a vision-capable model)

What it measures

For each model, the eval plays chess games across a matrix of:

Stockfish skill levels (0-20, configurable)
LLM color (white and black)
Repeated games (via --epochs)

And tracks:

Metric	Description
Win/loss/draw rate	`accuracy` in Inspect (1.0 = win, 0.5 = draw, 0.0 = loss)
Game length	Average full moves per game (longer against stronger Stockfish = better play)
Invalid move rate	Average invalid move attempts per game (lower = better notation understanding)

Setup

Prerequisites

Python 3.10+
A Stockfish binary (e.g. brew install stockfish on macOS)
An OpenRouter API key
For main_image.py: no extra system dependencies (uses fentoboardimage + Pillow)

Install dependencies

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Set your API key

export OPENROUTER_API_KEY=your-openrouter-api-key

Usage

Run an evaluation with inspect eval:

# Stateless (main.py)
inspect eval main.py --model openrouter/openai/gpt-4o --epochs 3

# In-context conversation (main_in_context.py)
inspect eval main_in_context.py --model openrouter/openai/gpt-4o --epochs 3

# With board images (main_image.py) — requires a vision-capable model
inspect eval main_image.py --model openrouter/openai/gpt-4o --epochs 3

Task parameters

Customise Stockfish levels and path with -T:

inspect eval main.py \
  --model openrouter/openai/gpt-4o \
  -T stockfish_levels=[1,5,10,20] \
  -T stockfish_path=/opt/homebrew/bin/stockfish \
  --epochs 3

Parameter	Default	Description
`stockfish_levels`	`[1, 5, 10, 20]`	List of Stockfish skill levels (0-20)
`stockfish_path`	`/opt/homebrew/bin/stockfish`	Path to the Stockfish binary

Comparing models

Run the same eval with different --model flags:

inspect eval main.py --model openrouter/openai/gpt-4o --epochs 3
inspect eval main.py --model openrouter/anthropic/claude-sonnet-4-20250514 --epochs 3
inspect eval main.py --model openrouter/google/gemini-2.0-flash-001 --epochs 3
inspect eval main.py --model openrouter/deepseek/deepseek-chat --epochs 3

Viewing results

Inspect includes a web UI for browsing results:

inspect view

This shows per-game scores, metrics, full LLM conversation transcripts, and metadata. You can filter by sample ID (e.g. level-10-white) and compare across model runs.

How it works

Each eval run creates len(stockfish_levels) * 2 samples (one per level per color). With --epochs N, each sample is played N times, so the total number of games is len(stockfish_levels) * 2 * N.

The game loop:

A Stockfish engine is opened and configured to the sample's skill level
Moves alternate between Stockfish and the LLM based on the sample's color
On each LLM turn, the full PGN is sent as a prompt and the response is parsed as SAN
Invalid moves are retried up to 5 times (with feedback about which moves are invalid)
The game ends on checkmate, stalemate, draw, or if the LLM fails to produce a valid move

Background

This project started as a simple script pitting GPT-4 against Stockfish (see git history). It has since been rewritten as a proper eval harness using Inspect to test multiple frontier models systematically.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
examples		examples
experiment_logs/sf_1		experiment_logs/sf_1
pieces/letter		pieces/letter
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
acpl_cache.json		acpl_cache.json
analyze.py		analyze.py
analyze_experiment.py		analyze_experiment.py
analyze_invalid_timing.py		analyze_invalid_timing.py
calibrate_stockfish_elo.py		calibrate_stockfish_elo.py
calibration_results.json		calibration_results.json
experiment_report.md		experiment_report.md
fig10_acpl_accuracy.png		fig10_acpl_accuracy.png
fig1_length_and_invalid.png		fig1_length_and_invalid.png
fig2_outcomes.png		fig2_outcomes.png
fig3_color_comparison.png		fig3_color_comparison.png
fig4_termination.png		fig4_termination.png
fig5_heatmaps.png		fig5_heatmaps.png
fig6_categories.png		fig6_categories.png
fig7_distribution.png		fig7_distribution.png
fig8_marginal_effect.png		fig8_marginal_effect.png
fig9_composite_score.png		fig9_composite_score.png
main.py		main.py
main_experiment.py		main_experiment.py
main_image.py		main_image.py
main_in_context.py		main_in_context.py
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
run_experiment.sh		run_experiment.sh
scaffold_experiment_results.png		scaffold_experiment_results.png
scaffold_experiment_results_categories.png		scaffold_experiment_results_categories.png
slides.md		slides.md
tables.md		tables.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chess Eval

Eval variants

What it measures

Setup

Prerequisites

Install dependencies

Set your API key

Usage

Task parameters

Comparing models

Viewing results

How it works

Background

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chess Eval

Eval variants

What it measures

Setup

Prerequisites

Install dependencies

Set your API key

Usage

Task parameters

Comparing models

Viewing results

How it works

Background

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages